Building a news aggregator with the /crawl endpoint

The Maas at Dordrecht

A news aggregator is the canonical scraping pipeline. Pull articles from many sources, normalize, dedup, surface. The pattern is well understood; what's changed in 2026 is that LLM-powered extraction makes the pipeline simple enough to ship as a side project in a weekend, where it used to be a quarter of work for one engineer.

This post is that pipeline. The end product is a daily-ingest news aggregator covering ~50 sources at ~$5/day in extraction cost. The example uses Runo's /crawl endpoint; the architecture works with any extraction API that supports crawling.

What we're building#

The aggregator pulls articles from a configured list of news sources (TechCrunch, The Verge, Hacker News, Ars Technica, etc.), normalizes them into a common schema, dedups across sources (the same story often appears on multiple outlets), and exposes them via a search/feed UI.

Output schema per article:

[
  { "field": "title",         "type": "string", "example": "Apple announces new chip" },
  { "field": "subtitle",      "type": "string", "example": "M5 promises 30% gain" },
  { "field": "author",        "type": "string", "example": "Sarah Chen" },
  { "field": "publishedDate", "type": "date",   "example": "2026-06-05" },
  { "field": "source",        "type": "string", "example": "techcrunch.com" },
  { "field": "category",      "type": "string", "example": "technology" },
  { "field": "summary",       "type": "string", "example": "Apple unveiled..." },
  { "field": "tags",          "type": "array<string>", "example": ["apple", "chips"] },
  { "field": "imageUrl",      "type": "string", "example": "https://..." }
]

summary here is "the lead paragraph or article description," not an LLM-generated summary. Generating summaries from full body text is a separate step.

The discovery loop: where to find article URLs#

Three patterns, ordered by quality:

1. RSS / Atom feeds (the easiest case)#

Most reputable news sites publish RSS feeds with the latest 20-50 articles. Direct, structured, no scraping required for discovery. Each item has a <link> to the full article.

import feedparser

feed = feedparser.parse("https://techcrunch.com/feed/")
urls = [entry.link for entry in feed.entries]

For ~80% of news sources, this is the discovery layer. Cheap, fast, polite.

2. Sitemap crawl#

If a source doesn't have RSS but has a sitemap.xml, that works too. Sitemaps usually segment by year/month so you can pull recent ones cheaply.

import httpx, re

sitemap = httpx.get("https://example-news.com/sitemap-2026-06.xml").text
urls = re.findall(r"<loc>([^<]+)</loc>", sitemap)

3. Crawl from the homepage#

For sources without RSS or sitemap, the fallback is crawling the homepage with a follow pattern matching article URLs:

import httpx

resp = httpx.post(
    "https://api.scrapewithruno.com/v1/crawl",
    json={
        "seed_url": "https://example-news.com",
        "schema": [...the schema above...],
        "crawl": {
            "follow_pattern": "https://example-news.com/202[0-9]/*",
            "max_pages": 100,
            "max_depth": 2,
        },
    },
    headers={"X-API-Key": API_KEY},
).json()

The follow_pattern constrains the crawl to article-shaped URLs (most news sites use /year/month/slug URLs). max_depth: 2 lets the crawl follow homepage → category page → article. max_pages: 100 caps the total fetch.

/crawl reserves max_pages from your monthly quota upfront and refunds unused. If a homepage only links to 30 articles, you pay for 30, not 100.

Configuring sources#

A YAML file makes source config readable:

sources:
  - name: TechCrunch
    domain: techcrunch.com
    feed: https://techcrunch.com/feed/
    category: technology

  - name: The Verge
    domain: theverge.com
    feed: https://www.theverge.com/rss/index.xml
    category: technology

  - name: Hacker News (front page)
    domain: news.ycombinator.com
    discover_method: crawl
    seed: https://news.ycombinator.com
    follow_pattern: "https://news.ycombinator.com/item?id=*"
    max_pages: 30
    category: technology

  - name: Reuters Business
    domain: reuters.com
    feed: https://www.reuters.com/business/feed/
    category: business

50 sources fits in ~200 lines of YAML. Adding a new source is a 5-line change.

The extraction loop#

For each discovered URL, extract the structured article data. With Runo, the call is:

async def extract_article(client, url):
    return await client.post(
        "https://api.scrapewithruno.com/v1/extract",
        json={"url": url},
        headers={"X-API-Key": STATIC_KEY},  # schema bound to key
        timeout=60,
    )

A static key with the article schema pre-bound is the right choice here. You're calling this thousands of times with the same schema; the smaller payload and prompt-cache hit rate matter.

For batches, use /batch:

results = await client.post(
    "https://api.scrapewithruno.com/v1/batch",
    json={"urls": urls, "options": {"concurrency": 20}},
    headers={"X-API-Key": STATIC_KEY},
    timeout=600,
)

Concurrency of 20 is a reasonable default. Higher works but starts hitting per-host rate limits when many URLs share a domain.

Dedup across sources#

The same news event often hits 5-10 outlets within a few hours. Surfacing all 10 to your users is bad UX. Two layers of dedup:

Layer 1: URL canonicalization#

Same article, different URL because of UTM params or trailing slashes:

from urllib.parse import urlparse, urlunparse

def canonicalize(url):
    p = urlparse(url)
    # Strip query string, trailing slash, fragment
    return urlunparse((p.scheme, p.netloc, p.path.rstrip("/"), "", "", ""))

Catches the obvious dupes within a single source.

Layer 2: Cross-source semantic dedup#

Two outlets cover the same Apple announcement; titles differ slightly, bodies differ entirely. Need semantic similarity.

Approach: compute an embedding for each article (title + summary) and cluster within a 24-hour rolling window using HDBSCAN or a simple cosine-similarity threshold.

from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [f"{a['title']} {a['summary']}" for a in articles_24h]
embeddings = model.encode(texts)

# DBSCAN with cosine distance threshold ~0.3 for "same story"
clusters = DBSCAN(eps=0.3, min_samples=1, metric="cosine").fit_predict(embeddings)

Each cluster is one news event. Pick a representative article per cluster (preference: highest-traffic source first published). Surface the cluster to the user as one item with "also covered by [N] other outlets."

This is also the data structure that powers "follow this story" features. Each cluster gets a stable cluster_id; new articles that match a cluster get appended.

Refresh cadence#

For a news aggregator, hourly or 30-minute discovery is the right cadence. Articles are highest-value within the first few hours of publication.

# scheduled job
async def discover_and_extract(source_config):
    urls = discover(source_config)            # RSS / sitemap / crawl
    new_urls = filter_already_seen(urls)      # check against database
    if not new_urls:
        return
    articles = await extract_batch(new_urls)
    save_to_db(articles)
    update_clusters(articles)                 # cross-source dedup

# every 30 min for hot sources, hourly for slow ones
schedule.every(30).minutes.do(run_for, hot_sources)
schedule.every().hour.do(run_for, slow_sources)

The "already seen" filter is a database lookup on canonical URL. Skip extraction for URLs you've seen, even if the RSS feed listed them.

Cost math#

For a 50-source aggregator with hourly discovery:

Component Per day Per month
Discovery (RSS feeds + occasional crawl) ~50 RSS + 10 crawl seeds ~$2
Extraction (~30 new articles/source/day × 50 sources × $0.001) ~1,500 articles × $0.001 = $1.50 ~$45
Embedding (1,500 articles × ~$0.00001) ~$0.015 ~$0.50
Database / hosting ~$1 ~$30
Total ~$5/day ~$80/month

Fits comfortably in the Runo Pro tier ($59/month for 50K extraction requests; you'd use ~45K). For lower scale (10 sources), Starter ($17/month, 15K requests) is enough.

The summarisation step (optional)#

For a UX where each article gets a one-paragraph summary, run a separate LLM pass after extraction. Two patterns:

  1. Per-cluster summary: summarise the cluster as a whole, citing all sources. Better for "what happened" UX.
  2. Per-article summary: summarise each article individually. Better for "what does this outlet say about the event" UX.

This is a separate LLM call (Gemini Flash, GPT-4o-mini, or similar) at ~$0.0001 per summary. Cluster-level summarisation is cheaper and the UX is usually better.

What about Hacker News specifically#

Hacker News has a real API (https://hacker-news.firebaseio.com) that returns structured story data without scraping. For HN specifically, use the API. For comments and discussion, ditto.

The general principle: when a source publishes a real API, use it. Scraping is for when the API doesn't exist, doesn't expose what you need, or is restricted to verified partners.

What about paywalled sources#

NYT, WSJ, FT, Bloomberg. The articles are visible publicly via their RSS feeds (which include lead paragraphs and metadata) but bodies are paywalled. Your options:

  1. Use only what's in the RSS feed: title, lead paragraph, link. Most news aggregators do this and it works fine for a "see what's happening" UX. The link drives the click to the source, which is what publishers want.
  2. Subscribe and use authenticated access: requires per-publisher business deals; not practical for an indie aggregator.
  3. Try archive sources: Wayback Machine sometimes has cached copies. Reader-view URLs sometimes bypass soft paywalls. This is a gray area that depends on the publisher's stance; we'd push you toward option 1.

Some scraping APIs include an archive fallback (Wayback, Google Cache) when the live page is hard-blocked. For paywalled content this is hit-or-miss and not the recommended path for a commercial product.

Polite crawling#

News sites tend to be sympathetic to aggregators (they want the traffic) but only if you're polite:

  • Respect robots.txt. Most explicitly allow indexing; some restrict specific paths.
  • Per-host concurrency under 5; jitter between requests.
  • Identify your bot in User-Agent (e.g., MyAggregator/1.0 (+https://yoursite.com/bot)). Hiding makes you look adversarial.
  • Don't republish full article body verbatim. Lead paragraph + link respects fair use; copying the article doesn't.

Runo handles per-host pacing automatically. Identification you set in the request options.

What you build vs what Runo handles#

Runo handles:

  • Bypass (Cloudflare on news sites is increasingly common)
  • Schema-driven extraction
  • Per-host rate adaptation
  • Crawl budget management with refund-on-cancel

You build:

  • Source configuration
  • Discovery layer (RSS / sitemap / crawl router)
  • Already-seen filter (database lookup)
  • Cross-source dedup (embeddings + clustering)
  • Cluster representation in the UI
  • Summarisation pass if you want one
  • Search index (Postgres full-text, Meilisearch, Typesense; pick one)

The build is a few hundred lines of Python plus a database. Single weekend if you've shipped data pipelines before; a week if it's your first.

TL;DR#

  • Discovery is mostly RSS feeds (cheapest, easiest), sitemaps as fallback, and crawl-from-homepage for sources without either.
  • Use a static key with the article schema bound; cuts payload and may hit provider-side prompt cache for cheaper input tokens.
  • /crawl reserves the budget upfront and refunds unused; safe to over-provision max_pages.
  • Cross-source dedup with sentence-transformers + DBSCAN on title+summary embeddings, 24h rolling window. Each cluster is one news event.
  • Total cost for 50 sources, hourly discovery: ~$80/month all-in on the Runo Pro tier.
  • For Hacker News: use the official API. Scrape only when no API exists.
  • For paywalled sources: stick to RSS-published lead paragraphs + link to source. Don't try to bypass paywalls; the legal and ethical exposure isn't worth it.
  • Be polite: respect robots.txt, identify your bot in User-Agent, don't republish full bodies.
View of the Colosseum
Tutorial9 min read

Building a real-estate data pipeline with a scraping API

An end-to-end walkthrough of pulling listings from multiple real-estate portals into a normalized database. Schema design, dedup, refresh cadence, and cost.

Medusa, by Caravaggio
Tutorial9 min read

Sentiment analysis from product reviews: a practical pipeline

How to scrape product reviews at scale and turn them into actionable sentiment data. Schema design, aspect-based sentiment, and avoiding the common pitfalls.

The Art of Painting, by Johannes Vermeer
Ecommerce10 min read

Schema design patterns for e-commerce extraction

Battle-tested schema patterns for product pages, category pages, reviews, and inventory. Edge cases, type choices, and the fields people forget.