A news aggregator is the canonical scraping pipeline. Pull articles from many sources, normalize, dedup, surface. The pattern is well understood; what's changed in 2026 is that LLM-powered extraction makes the pipeline simple enough to ship as a side project in a weekend, where it used to be a quarter of work for one engineer.
This post is that pipeline. The end product is a daily-ingest news aggregator covering ~50 sources at ~$5/day in extraction cost. The example uses Runo's /crawl endpoint; the architecture works with any extraction API that supports crawling.
What we're building#
The aggregator pulls articles from a configured list of news sources (TechCrunch, The Verge, Hacker News, Ars Technica, etc.), normalizes them into a common schema, dedups across sources (the same story often appears on multiple outlets), and exposes them via a search/feed UI.
Output schema per article:
[
{ "field": "title", "type": "string", "example": "Apple announces new chip" },
{ "field": "subtitle", "type": "string", "example": "M5 promises 30% gain" },
{ "field": "author", "type": "string", "example": "Sarah Chen" },
{ "field": "publishedDate", "type": "date", "example": "2026-06-05" },
{ "field": "source", "type": "string", "example": "techcrunch.com" },
{ "field": "category", "type": "string", "example": "technology" },
{ "field": "summary", "type": "string", "example": "Apple unveiled..." },
{ "field": "tags", "type": "array<string>", "example": ["apple", "chips"] },
{ "field": "imageUrl", "type": "string", "example": "https://..." }
]
summary here is "the lead paragraph or article description," not an LLM-generated summary. Generating summaries from full body text is a separate step.
The discovery loop: where to find article URLs#
Three patterns, ordered by quality:
1. RSS / Atom feeds (the easiest case)#
Most reputable news sites publish RSS feeds with the latest 20-50 articles. Direct, structured, no scraping required for discovery. Each item has a <link> to the full article.
import feedparser
feed = feedparser.parse("https://techcrunch.com/feed/")
urls = [entry.link for entry in feed.entries]
For ~80% of news sources, this is the discovery layer. Cheap, fast, polite.
2. Sitemap crawl#
If a source doesn't have RSS but has a sitemap.xml, that works too. Sitemaps usually segment by year/month so you can pull recent ones cheaply.
import httpx, re
sitemap = httpx.get("https://example-news.com/sitemap-2026-06.xml").text
urls = re.findall(r"<loc>([^<]+)</loc>", sitemap)
3. Crawl from the homepage#
For sources without RSS or sitemap, the fallback is crawling the homepage with a follow pattern matching article URLs:
import httpx
resp = httpx.post(
"https://api.scrapewithruno.com/v1/crawl",
json={
"seed_url": "https://example-news.com",
"schema": [...the schema above...],
"crawl": {
"follow_pattern": "https://example-news.com/202[0-9]/*",
"max_pages": 100,
"max_depth": 2,
},
},
headers={"X-API-Key": API_KEY},
).json()
The follow_pattern constrains the crawl to article-shaped URLs (most news sites use /year/month/slug URLs). max_depth: 2 lets the crawl follow homepage → category page → article. max_pages: 100 caps the total fetch.
/crawl reserves max_pages from your monthly quota upfront and refunds unused. If a homepage only links to 30 articles, you pay for 30, not 100.
Configuring sources#
A YAML file makes source config readable:
sources:
- name: TechCrunch
domain: techcrunch.com
feed: https://techcrunch.com/feed/
category: technology
- name: The Verge
domain: theverge.com
feed: https://www.theverge.com/rss/index.xml
category: technology
- name: Hacker News (front page)
domain: news.ycombinator.com
discover_method: crawl
seed: https://news.ycombinator.com
follow_pattern: "https://news.ycombinator.com/item?id=*"
max_pages: 30
category: technology
- name: Reuters Business
domain: reuters.com
feed: https://www.reuters.com/business/feed/
category: business
50 sources fits in ~200 lines of YAML. Adding a new source is a 5-line change.
The extraction loop#
For each discovered URL, extract the structured article data. With Runo, the call is:
async def extract_article(client, url):
return await client.post(
"https://api.scrapewithruno.com/v1/extract",
json={"url": url},
headers={"X-API-Key": STATIC_KEY}, # schema bound to key
timeout=60,
)
A static key with the article schema pre-bound is the right choice here. You're calling this thousands of times with the same schema; the smaller payload and prompt-cache hit rate matter.
For batches, use /batch:
results = await client.post(
"https://api.scrapewithruno.com/v1/batch",
json={"urls": urls, "options": {"concurrency": 20}},
headers={"X-API-Key": STATIC_KEY},
timeout=600,
)
Concurrency of 20 is a reasonable default. Higher works but starts hitting per-host rate limits when many URLs share a domain.
Dedup across sources#
The same news event often hits 5-10 outlets within a few hours. Surfacing all 10 to your users is bad UX. Two layers of dedup:
Layer 1: URL canonicalization#
Same article, different URL because of UTM params or trailing slashes:
from urllib.parse import urlparse, urlunparse
def canonicalize(url):
p = urlparse(url)
# Strip query string, trailing slash, fragment
return urlunparse((p.scheme, p.netloc, p.path.rstrip("/"), "", "", ""))
Catches the obvious dupes within a single source.
Layer 2: Cross-source semantic dedup#
Two outlets cover the same Apple announcement; titles differ slightly, bodies differ entirely. Need semantic similarity.
Approach: compute an embedding for each article (title + summary) and cluster within a 24-hour rolling window using HDBSCAN or a simple cosine-similarity threshold.
from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [f"{a['title']} {a['summary']}" for a in articles_24h]
embeddings = model.encode(texts)
# DBSCAN with cosine distance threshold ~0.3 for "same story"
clusters = DBSCAN(eps=0.3, min_samples=1, metric="cosine").fit_predict(embeddings)
Each cluster is one news event. Pick a representative article per cluster (preference: highest-traffic source first published). Surface the cluster to the user as one item with "also covered by [N] other outlets."
This is also the data structure that powers "follow this story" features. Each cluster gets a stable cluster_id; new articles that match a cluster get appended.
Refresh cadence#
For a news aggregator, hourly or 30-minute discovery is the right cadence. Articles are highest-value within the first few hours of publication.
# scheduled job
async def discover_and_extract(source_config):
urls = discover(source_config) # RSS / sitemap / crawl
new_urls = filter_already_seen(urls) # check against database
if not new_urls:
return
articles = await extract_batch(new_urls)
save_to_db(articles)
update_clusters(articles) # cross-source dedup
# every 30 min for hot sources, hourly for slow ones
schedule.every(30).minutes.do(run_for, hot_sources)
schedule.every().hour.do(run_for, slow_sources)
The "already seen" filter is a database lookup on canonical URL. Skip extraction for URLs you've seen, even if the RSS feed listed them.
Cost math#
For a 50-source aggregator with hourly discovery:
| Component | Per day | Per month |
|---|---|---|
| Discovery (RSS feeds + occasional crawl) | ~50 RSS + 10 crawl seeds | ~$2 |
| Extraction (~30 new articles/source/day × 50 sources × $0.001) | ~1,500 articles × $0.001 = $1.50 | ~$45 |
| Embedding (1,500 articles × ~$0.00001) | ~$0.015 | ~$0.50 |
| Database / hosting | ~$1 | ~$30 |
| Total | ~$5/day | ~$80/month |
Fits comfortably in the Runo Pro tier ($59/month for 50K extraction requests; you'd use ~45K). For lower scale (10 sources), Starter ($17/month, 15K requests) is enough.
The summarisation step (optional)#
For a UX where each article gets a one-paragraph summary, run a separate LLM pass after extraction. Two patterns:
- Per-cluster summary: summarise the cluster as a whole, citing all sources. Better for "what happened" UX.
- Per-article summary: summarise each article individually. Better for "what does this outlet say about the event" UX.
This is a separate LLM call (Gemini Flash, GPT-4o-mini, or similar) at ~$0.0001 per summary. Cluster-level summarisation is cheaper and the UX is usually better.
What about Hacker News specifically#
Hacker News has a real API (https://hacker-news.firebaseio.com) that returns structured story data without scraping. For HN specifically, use the API. For comments and discussion, ditto.
The general principle: when a source publishes a real API, use it. Scraping is for when the API doesn't exist, doesn't expose what you need, or is restricted to verified partners.
What about paywalled sources#
NYT, WSJ, FT, Bloomberg. The articles are visible publicly via their RSS feeds (which include lead paragraphs and metadata) but bodies are paywalled. Your options:
- Use only what's in the RSS feed: title, lead paragraph, link. Most news aggregators do this and it works fine for a "see what's happening" UX. The link drives the click to the source, which is what publishers want.
- Subscribe and use authenticated access: requires per-publisher business deals; not practical for an indie aggregator.
- Try archive sources: Wayback Machine sometimes has cached copies. Reader-view URLs sometimes bypass soft paywalls. This is a gray area that depends on the publisher's stance; we'd push you toward option 1.
Some scraping APIs include an archive fallback (Wayback, Google Cache) when the live page is hard-blocked. For paywalled content this is hit-or-miss and not the recommended path for a commercial product.
Polite crawling#
News sites tend to be sympathetic to aggregators (they want the traffic) but only if you're polite:
- Respect
robots.txt. Most explicitly allow indexing; some restrict specific paths. - Per-host concurrency under 5; jitter between requests.
- Identify your bot in
User-Agent(e.g.,MyAggregator/1.0 (+https://yoursite.com/bot)). Hiding makes you look adversarial. - Don't republish full article body verbatim. Lead paragraph + link respects fair use; copying the article doesn't.
Runo handles per-host pacing automatically. Identification you set in the request options.
What you build vs what Runo handles#
Runo handles:
- Bypass (Cloudflare on news sites is increasingly common)
- Schema-driven extraction
- Per-host rate adaptation
- Crawl budget management with refund-on-cancel
You build:
- Source configuration
- Discovery layer (RSS / sitemap / crawl router)
- Already-seen filter (database lookup)
- Cross-source dedup (embeddings + clustering)
- Cluster representation in the UI
- Summarisation pass if you want one
- Search index (Postgres full-text, Meilisearch, Typesense; pick one)
The build is a few hundred lines of Python plus a database. Single weekend if you've shipped data pipelines before; a week if it's your first.
TL;DR#
- Discovery is mostly RSS feeds (cheapest, easiest), sitemaps as fallback, and crawl-from-homepage for sources without either.
- Use a static key with the article schema bound; cuts payload and may hit provider-side prompt cache for cheaper input tokens.
/crawlreserves the budget upfront and refunds unused; safe to over-provisionmax_pages.- Cross-source dedup with sentence-transformers + DBSCAN on title+summary embeddings, 24h rolling window. Each cluster is one news event.
- Total cost for 50 sources, hourly discovery: ~$80/month all-in on the Runo Pro tier.
- For Hacker News: use the official API. Scrape only when no API exists.
- For paywalled sources: stick to RSS-published lead paragraphs + link to source. Don't try to bypass paywalls; the legal and ethical exposure isn't worth it.
- Be polite: respect robots.txt, identify your bot in User-Agent, don't republish full bodies.