LLM extraction vs CSS selectors: why selector-based scraping is dead at scale

An Experiment on a Bird in the Air Pump

For fifteen years, web scraping meant CSS selectors. You inspected the page, found .product-price, wrote the selector, and shipped. The selector approach is intuitive, fast, and free. At any scale beyond a single site, it's now the wrong choice.

This post explains why, with the cost numbers that make the tradeoff concrete. The TL;DR is at the bottom; the math is in the middle.

The selector approach, briefly#

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
price = soup.select_one(".product-price").text
title = soup.select_one("h1.product-name").text
in_stock = "in-stock" in soup.select_one(".availability")["class"]

It works. It's fast. It costs nothing per request. For one site you control, it's still the right answer.

The problem is everything outside that.

Why selectors break#

Selectors are coupled to DOM structure. The DOM changes for reasons that have nothing to do with you:

  • A/B tests rotate class names (.product-price becomes .product-price-v2)
  • Design system migrations rewrite the markup wholesale
  • Marketing teams ship new templates without telling engineering
  • Server-side rendering switches to client-side rendering and your selector is now in a <template> that never paints
  • Dark-mode theming wraps everything in a new ancestor element
  • Internationalised pages change the structure based on locale

Each of these is a routine deploy on the target site. Each silently breaks your selector. Your monitoring catches it tomorrow when the dashboard shows zero scrapes succeeding for that source. Then someone debugs, rewrites, ships, and waits for the next break.

For ten sites, this is annoying. For a hundred sites, it's a full-time job. For arbitrary user-supplied URLs (which is what scraping APIs accept), it's impossible. You can't write selectors for sites you've never seen.

The LLM approach#

Send the cleaned HTML to an LLM with a schema. Get typed JSON back.

{
  "url": "https://example.com/product/widget",
  "schema": [
    { "field": "title",    "type": "string", "example": "Acme Widget Pro" },
    { "field": "price",    "type": "float",  "example": 29.99 },
    { "field": "in_stock", "type": "boolean","example": true }
  ]
}

The LLM doesn't care about class names. It reads the page semantically (this element looks like a product title, this number with a $ looks like a price, this "In Stock" badge looks like availability) and returns the values. Site redesigns don't break it. Sites you've never seen work the first time.

This is what Runo does as its core mode. We covered the integration shape in extracting structured JSON from any HTML.

The objection: "LLMs are expensive and slow"#

Five years ago, true. Today, not really. Let's do the math.

A typical product page is ~30KB of cleaned HTML, around 8K input tokens after cleaning. A schema-shaped extraction returns ~200 output tokens. On a current cheap-tier model (Gemini Flash-Lite, GPT-4o-mini, Claude Haiku and similar all sit in the same ballpark in 2026):

  • Input: 8,000 tokens × ~$0.10/M = ~$0.0008
  • Output: 200 tokens × ~$0.40/M = ~$0.00008
  • Total: ~$0.00088 per page

That's under one-tenth of a cent. For comparison, a residential proxy request alone is around $0.0002–0.001. The LLM cost is in the same order of magnitude as the proxy cost, which means it's not the dominant cost line in modern scraping.

Latency: cheap-tier models return in ~1–3 seconds for typical extraction prompts. Combined with fetch and clean, total per-call latency lands well under most "real" scraping use cases tolerate, most of which is fetch, not LLM. Selector-based extraction would save maybe 1–2 seconds at the cost of breaking weekly. The latency is no longer the bottleneck.

The other objection: "LLMs hallucinate"#

Also true and also less of a problem than it used to be, if you do three things:

  1. Use structured output mode (response_mime_type: "application/json"). Modern Gemini and GPT models with structured output return strictly conforming JSON. The model isn't "writing JSON" any more, it's filling a typed slot.
  2. Provide example values per field. { "field": "price", "type": "float", "example": 29.99 } grounds the model on what shape the answer takes. The example acts as a one-shot prompt anchor.
  3. Enforce type coercion at the API boundary. If the model returns "twenty" for an integer field, the extractor returns null and a TYPE_COERCION_FAILED warning, instead of a fabricated number.

With those three in place, hallucinations land as nulls, not as bad data, which is the failure mode you actually want. Field-level null rates on diverse-URL evaluations land in the low double digits with most pages returning fully clean responses.

When selectors still beat LLMs#

Three cases:

One target site you fully control. If you're scraping your own internal admin panel, selectors are fine. The DOM doesn't change without your knowledge.

Extreme volume against a static-HTML site. If you're pulling 50M pages per month from one site and the per-request cost matters more than maintenance burden, a hand-tuned selector parser is cheaper. For most teams this never applies. 50M/month is a lot of pages.

A site with rich, unambiguous structured markup. If the page already emits JSON-LD or application/ld+json with the data you want, parsing the structured markup directly is faster and cheaper than an LLM call. Good extraction APIs do this on a fast path: when JSON-LD/OpenGraph/Twitter Cards/oEmbed cover the schema fields, the LLM call is skipped entirely.

The hybrid is the answer. Try the structured fast path first; fall back to LLM extraction when it doesn't cover the schema.

When LLMs decisively beat selectors#

  • Many target sites. Anything beyond a single source. The maintenance cost of selectors per-site grows linearly; LLM extraction stays flat.
  • User-supplied URLs. If your product accepts arbitrary URLs from users, you can't pre-write selectors. LLM extraction works first-try.
  • Sites that change frequently. Modern e-commerce, social, news. Selector half-life is weeks; LLM extraction shrugs.
  • Sites with multiple themes or A/B tests. One LLM extraction handles all variants; selectors need branches per variant.
  • Schemas that vary across pages. Sometimes a product page has a sale price, sometimes it doesn't, sometimes a "starting from" price. Selectors need conditional logic; the LLM reads context.
  • Multilingual sites. A class name might be .preis in German and .price in English; the LLM doesn't care.

Cost comparison at realistic scales#

Volume Selector approach LLM approach (cheap-tier model, no fast path)
1K pages/mo ~$0 + 0.5 dev-day setup ~$0.88 + 0 maintenance
100K pages/mo ~$0 + 1 dev-day/mo maintenance ~$88 + 0 maintenance
1M pages/mo ~$0 + 5 dev-days/mo maintenance ~$880 + 0 maintenance
10M pages/mo Worth a custom parser team ~$8,800 + 0 maintenance

A "dev-day" of maintenance at a typical fully-loaded engineering cost ($1,200/day) means selectors cross over to "more expensive than LLMs" at around 1M pages/month if you're maintaining ~10 sites. Add bypass infrastructure cost and the crossover happens much earlier.

For most teams, the right read is: LLMs are now the default and selectors are the optimisation you reach for at extreme scale on extreme-stability targets.

What "good" LLM extraction looks like#

If you're rolling your own (or evaluating a vendor), the things that matter:

  • Cleaned HTML, not raw HTML. Strip nav, footer, ads, cookie banners. trafilatura is the workhorse; BeautifulSoup as fallback.
  • Schema with examples. { "field", "type", "example" }. The example value is doing real work as a one-shot anchor.
  • Structured output mode. response_mime_type: "application/json". Don't parse free-form responses.
  • Type coercion at the boundary. "$1.2M" becomes 1200000.0, "35 years old" becomes 35, ISO 8601 dates. If your application code is parsing strings, the extractor failed.
  • Honest nulls. Missing fields return null; never silently drop keys.
  • A fallback path when the cheap model returns mostly nulls — retry on a stronger model before giving up.
  • Source priority by field type. Identity fields (titles, names, IDs) come from <h1>, <title>, OG tags. Description fields come from body prose. Numeric stats come from page metadata.
  • A pre-filter for long pages. Trim against schema field names so you're not paying for 100K input tokens on a single page.

That stack (cleaned input + structured output + example-grounded schema + type coercion + null fallback) is what makes the extraction approach work in production. Runo ships it as a hosted service.

TL;DR#

  • CSS selectors break every time a site redesigns. Maintenance cost per site grows linearly; LLM extraction stays flat.
  • Modern LLM extraction costs roughly one-tenth of a cent per page on a cheap-tier model, the same order as a residential proxy request. Cost is no longer the objection.
  • Hallucinations land as nulls (not bad data) when you use structured output mode + example-grounded schema + type coercion at the API boundary.
  • Selectors still win for: one site you control, extreme-volume static-HTML scraping, or pages with rich JSON-LD already.
  • LLMs decisively win for: many sites, user-supplied URLs, frequent redesigns, A/B tests, schemas that vary across pages, multilingual content.
  • The right architecture is hybrid: structured-data fast path, then LLM extraction fallback. That's what Runo ships.
Michelangelo in His Studio Visited by Pope Julius II, by Alexandre Cabanel
Comparison8 min read

Firecrawl vs Apify vs Runo: which scraping API to pick in 2026

An honest, side-by-side look at three popular scraping APIs. What each is built for, where each shines, and where each costs you time and money.

The Eclipse in Venice
Engineering8 min read

Scraping JavaScript-heavy SPAs: Next.js, Nuxt, and React in 2026

Why plain HTTP fetching returns empty pages on modern frontends, what render targets work, and how to recover server-shipped data without a headless browser.

Seascape Study with Rain Cloud
Engineering7 min read

How to scrape Cloudflare-protected sites without getting blocked

A practical, layered approach to defeating Cloudflare's bot challenges in 2026. TLS fingerprints, hardened headless, cookie persistence, and when to escalate.