Extracting structured JSON from any HTML: a developer's guide

You have a URL. You want JSON. Not "the page as JSON", but JSON shaped exactly the way your downstream code wants to consume it, with price as a float, in_stock as a boolean, and tags as an array<string>. Missing fields should be null, not absent. Wrong types should fail loudly.

This is a guide to doing that well, whether you're building it yourself or evaluating an API like Runo that does it for you. The principles are the same; the build cost differs.

The schema shape that works#

Three fields per slot is the minimum that produces consistent output:

{
  "field": "price",
  "type":  "float",
  "example": 29.99
}

That's it. Optionally a fourth: a hint for edge cases ("Use sale price if both are listed").

Why these three:

field is the JSON key you want in the response. Use the same naming convention you use everywhere else in your code (camelCase, snake_case, pick one).
type drives coercion at the API boundary. The supported set in 2026 is roughly: string, integer, float, boolean, date, array<T> for T in string|integer|float.
example is the most underrated field in scraping schemas. It's doing two jobs: documenting what the field looks like for the next developer, and grounding the LLM's interpretation as a one-shot prompt anchor.

The example matters more than people expect. { "field": "price", "type": "float", "example": 29.99 } resolves a lot of ambiguity that a Zod-style type-only schema doesn't. Should "$1,200" be 1200.0 or 1200? The example shows. We covered why this works in LLM extraction vs CSS selectors.

Type coercion: what to enforce at the boundary#

The whole point of typed JSON is that the consumer doesn't have to parse strings. Coerce aggressively at the extraction layer.

Type	Input examples	Coerced output
`string`	Anything	The string
`integer`	`"35"`, `"35 years old"`, `"35,000"`, `"3.5K"`	`35`, `35`, `35000`, `3500`
`float`	`"$29.99"`, `"$1.2M"`, `"1,200.50"`, `"twelve million"`	`29.99`, `1200000.0`, `1200.5`, `12000000.0`
`boolean`	`"yes"`, `"✓ Verified"`, `"in stock"`, `"out of stock"`	`true`, `true`, `true`, `false`
`date`	`"May 9, 2026"`, `"2026-05-09"`, `"yesterday"`, `"3 days ago"`	All to `"2026-05-09"` (ISO 8601, today-relative resolved at extraction time)
`array<string>`	`"actress, producer"`, `"actress and producer"`, `["actress", "producer"]`	`["actress", "producer"]` always

If the value can't be coerced (e.g., "twenty" for an integer field), return null and emit a TYPE_COERCION_FAILED warning, instead of fabricating a number. Loud failure beats silent corruption.

Null handling: the rule that saves you#

Unresolvable fields return null. Never silently drop the key.

This is the difference between a working pipeline and a quietly corrupted one. If your downstream code does data["price"], the key being missing is a KeyError; the key being null is something your code can branch on.

{
  "url": "...",
  "data": {
    "title": "Acme Widget",
    "price": null,
    "in_stock": true
  }
}

price is null because the page didn't expose a price (out of stock, "contact for quote", etc.). The consumer knows. Compare to:

{ "data": { "title": "Acme Widget", "in_stock": true } }

Now data["price"] raises. Or worse, your code defaults to 0 and writes a $0 product to your database. The first form is correct; the second causes incidents.

Same rule applies to fields that exist but coerce to a sentinel ("", 0, false). Distinguish "the page said the price is $0" (legitimate) from "we couldn't find a price" (null). Most LLMs will collapse these without explicit guidance. Your prompt has to tell them not to.

Schema design patterns that work#

Use natural field names#

Don't name fields field_1, field_2. Don't abbreviate. Use the same names a human would write.

{ "field": "publishedDate", "type": "date", "example": "2026-05-09" }

LLMs ground on field names. publishedDate is more specific than date, which could mean the date the page was scraped or the date someone was born.

Pick examples from the target domain#

If you're scraping recipes, your example for prepTime should be something like 15 (minutes), not 30 (which would also work but is less informative). If you're scraping product reviews, the example for rating should reflect the scale: 4.5 for 1–5 stars, 9 for 1–10 ratings. The example anchors the model on the format.

Use hints sparingly, only for ambiguity#

Hints are a free-form hint string per field. Most fields don't need one. Reach for hints only when:

There are multiple plausible candidates and you want one specifically (hint: "Use sale price if present, otherwise list price")
The page convention requires interpretation (hint: "Distance in km, not miles")
The format is non-obvious (hint: "Author's display name, not username")

A schema with a hint on every field is a smell. Either the field names aren't expressive enough, or the schema is fighting the data shape.

The order of fields in the schema matters less than people think (modern LLMs are order-insensitive at typical schema sizes), but readability matters. Group identity fields, descriptive fields, numeric stats, and temporal fields together. Your future self will thank you.

Don't over-decompose#

[
  { "field": "addressStreet", "type": "string", "example": "..." },
  { "field": "addressCity",   "type": "string", "example": "..." },
  { "field": "addressState",  "type": "string", "example": "..." },
  { "field": "addressZip",    "type": "string", "example": "..." }
]

Versus:

[ { "field": "address", "type": "string", "example": "123 Main St, San Francisco, CA 94103" } ]

The second is more reliable. Pages don't structure addresses as four discrete elements; they structure them as one string. Decompose downstream if you need to. The same logic applies to names (one fullName field beats firstName + lastName for arbitrary sites).

Edge cases that bite#

List pages vs detail pages#

If you point a single-URL extraction at a list page (e.g. a category page with 20 products), you'll usually get the first item or a confused average. Either crawl the list to get detail URLs, then extract from each detail page, or define an array<> schema designed for list extraction.

Pages where the data is in an image#

Hotels, menus, real-estate floor plans, infographics. Sometimes the data your schema asks for is rendered as an image, not text. The text-only LLM pass returns null. The fix is an image-augmentation pass: a vision-capable model looks at the top-scored images on the page and fills the null fields. Some scraping APIs (Runo on Scale via process_images: true) ship this as a built-in option for image-heavy pages.

JSON-shaped fields embedded in JS#

Some sites render the page client-side and embed the data in a <script> tag (__NEXT_DATA__, window.__NUXT__). A plain HTML extractor misses this; a headless browser extracts the rendered DOM and gets it. If your extraction pipeline only does plain fetch, you'll silently lose data on Next.js/Nuxt sites. The page returns 200 OK with content, but the content is the loading skeleton, not the product.

Sites that paginate via JS#

Infinite-scroll sites only render the first viewport. Headless browser scrolling is the fix; you scroll N times, wait for network idle, then extract. This is fiddly enough that it's usually worth being explicit about pagination behaviour rather than hoping the headless setup handles it.

Cloudflare/Datadome between you and the page#

The extraction layer doesn't matter if the fetch layer is returning a "Verifying you are human" interstitial. The HTML you feed the LLM has to be the real page. We covered the bypass stack in how to scrape Cloudflare-protected sites.

A complete example#

curl -X POST https://api.scrapewithruno.com/v1/extract \
  -H "X-API-Key: $RUNO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/widget-pro",
    "schema": [
      { "field": "title",       "type": "string",        "example": "Acme Widget Pro" },
      { "field": "price",       "type": "float",         "example": 29.99 },
      { "field": "currency",    "type": "string",        "example": "USD" },
      { "field": "inStock",     "type": "boolean",       "example": true },
      { "field": "rating",      "type": "float",         "example": 4.5 },
      { "field": "reviewCount", "type": "integer",       "example": 142 },
      { "field": "tags",        "type": "array<string>", "example": ["productivity", "office"] }
    ]
  }'

Response:

{
  "url": "https://example.com/product/widget-pro",
  "status": "success",
  "render_mode": "fetch",
  "data": {
    "title": "Acme Widget Pro",
    "price": 29.99,
    "currency": "USD",
    "inStock": true,
    "rating": 4.5,
    "reviewCount": 142,
    "tags": ["productivity", "office"]
  }
}

That's the pattern. Schema in, typed JSON out. Full reference in the Runo docs.

Build vs buy#

If you build this yourself, the components are: HTTP client with TLS impersonation, hardened headless browser, HTML cleaner (trafilatura + custom strippers), prompt builder, LLM client with retry/backoff/key rotation, type coercion layer, schema validator, error taxonomy. None of those are individually hard. All of them together, plus maintenance, plus bypass infrastructure, plus the LLM cost engineering. That's a quarter of focused work for one engineer to get to a usable v1, and it never stops needing attention.

If you buy, Runo ships the full stack with a free tier of 500 requests/month. Pricing scales from there. The docs walk through every option.

TL;DR#

The minimum viable schema entry is { field, type, example }. The example field is doing real work as a one-shot LLM anchor; never skip it.
Coerce types at the API boundary. "$1.2M" becomes 1200000.0, never a string for the consumer to parse.
Unresolvable fields return null, never silently dropped. The consumer should be able to branch on missingness.
Don't over-decompose schemas. One address string is more reliable than four address-component fields.
Watch for: list-vs-detail pages, image-embedded data, JS-rendered content, anti-bot interstitials returned as the page body.
Build it yourself if you have a quarter and ongoing maintenance budget; otherwise, Runo ships the stack.