Schema design patterns for e-commerce extraction

The Art of Painting, by Johannes Vermeer

E-commerce extraction looks like the easy case until you try to write one schema that works across Amazon, Shopify storefronts, BigCommerce, Magento, and a long tail of bespoke sites. The fields look the same; the variations bite.

This post is the schema patterns we've seen work across thousands of e-commerce sites. The schemas are usable as-is; the principles transfer to any LLM-based extraction stack. Background on the schema model is in extracting structured JSON from any HTML.

The product page baseline#

Start with the spine that covers ~90% of product pages:

[
  { "field": "title",         "type": "string",        "example": "Acme Widget Pro" },
  { "field": "brand",         "type": "string",        "example": "Acme" },
  { "field": "sku",           "type": "string",        "example": "AW-PRO-2026" },
  { "field": "price",         "type": "float",         "example": 29.99 },
  { "field": "currency",      "type": "string",        "example": "USD" },
  { "field": "originalPrice", "type": "float",         "example": 39.99 },
  { "field": "inStock",       "type": "boolean",       "example": true },
  { "field": "rating",        "type": "float",         "example": 4.5 },
  { "field": "reviewCount",   "type": "integer",       "example": 142 },
  { "field": "description",   "type": "string",        "example": "The Acme Widget Pro features..." },
  { "field": "imageUrls",     "type": "array<string>", "example": ["https://..."] }
]

Notes on each field:

title: usually <h1> or product header. LLMs handle this well; rarely a problem.

brand: not always on the page; sometimes only in meta tags or breadcrumb. The LLM picks it up from JSON-LD if present, falls back to inference from the title ("Acme Widget Pro" → brand "Acme").

sku: many sites have multiple identifiers (SKU, MPN, GTIN, ASIN, ISBN). Pick the one your downstream system needs. If you need multiple, declare separate fields.

price: float, not string. Coercion handles "$29.99", "29,99 €", "USD 29.99". Don't try to extract currency separately from price; the LLM will sometimes confuse them. Use a separate currency field.

originalPrice: only meaningful when the product is on sale. On full-price products, this returns null. That's correct: null distinguishes "no sale" from "we couldn't find a sale price."

inStock: boolean. Coerces from "In Stock", "Add to Cart", "Sold Out", "Backorder", "Available". Add a hint if your domain has unusual conventions: hint: "False if 'pre-order' or 'coming soon'".

rating / reviewCount: float for rating (4.5 stars is common), integer for count. Often surfaced in microdata; the structured fast path catches them without an LLM call.

imageUrls: array of all product image URLs. Don't try to filter to "the main one" in extraction. Capture everything, pick downstream.

The variants problem#

Most products have variants: size, color, configuration. Schemas have to model this.

Three approaches, ordered by complexity:

Approach 1: Single product, default variant only#

[
  { "field": "title",   "type": "string", "example": "Acme T-Shirt" },
  { "field": "price",   "type": "float",  "example": 24.99 },
  { "field": "color",   "type": "string", "example": "navy" },
  { "field": "size",    "type": "string", "example": "M" }
]

Returns whichever variant the product page defaults to. Fine for "show me the price" use cases; loses information about the variant matrix.

Approach 2: Per-page extraction with variant URL crawl#

Each variant has its own URL (?variant=123 or /sku-name). Crawl the variant URLs, extract each as a separate product record.

Slower but complete. Use when you need full variant data (price per color/size, stock per variant).

Approach 3: Variants as an array field#

[
  { "field": "title",    "type": "string",        "example": "Acme T-Shirt" },
  { "field": "variants", "type": "array<string>", "example": ["{color: 'navy', size: 'M', price: 24.99, inStock: true}"] }
]

LLM extracts the variant matrix as an array of stringified objects. You parse downstream.

This works but is fragile across sites: some hide variant data in JS, some only show selected variants. For high-fidelity extraction, approach 2 is more reliable. For quick extraction, approach 1 is fine.

Category pages: list extraction#

Category pages list multiple products per page. The schema becomes:

[
  {
    "field": "products",
    "type": "array<string>",
    "example": ["{title: 'Widget A', price: 29.99, url: 'https://...'}"]
  }
]

The LLM returns an array of stringified product objects. Post-process into structured records.

For pagination, use /crawl with a follow pattern:

curl -X POST https://api.scrapewithruno.com/v1/crawl \
  -H "X-API-Key: $RUNO_API_KEY" \
  -d '{
    "seed_url": "https://example-shop.com/category/widgets",
    "schema": [...],
    "crawl": {
      "follow_pattern": "https://example-shop.com/category/widgets?page=*",
      "max_pages": 30,
      "max_depth": 1
    }
  }'

For high-fidelity catalog ingestion, use the category pages only for URL discovery, then call /extract per product URL with the full product schema. Two calls but cleaner data.

Edge cases that bite#

Multi-currency sites#

Some sites show prices in the visitor's currency (geo-detected). Set Accept-Language and Accept-Currency headers to pin the currency you want, or use the URL parameter the site exposes (?currency=USD). Without this, you may get inconsistent currency across requests.

"MSRP" vs "list" vs "sale"#

Some sites display three prices: MSRP (manufacturer suggested), list (their normal), sale (current). Three fields:

[
  { "field": "msrp",      "type": "float", "example": 39.99 },
  { "field": "listPrice", "type": "float", "example": 29.99 },
  { "field": "salePrice", "type": "float", "example": 24.99 }
]

Or one price (current selling price) plus originalPrice (whatever the strikethrough is). The 2-field model is usually enough; the 3-field model is for retail-analytics use cases.

Bundle pricing#

"Buy 2 for $50" or "3-pack: $79.99". The "price" of the product is ambiguous. Decide what your downstream needs:

  • Per-unit price (parse and divide)
  • Bundle price (the displayed value)
  • Both, with pricePerUnit and bundlePrice fields

The LLM extracts what's on the page; the policy decision is yours.

Subscription pricing#

"$9.99/month" or "$99/year". The price is contextual. Add a pricingType field:

{ "field": "pricingType", "type": "string", "example": "subscription_monthly" }

With a hint: "One of: one_time, subscription_monthly, subscription_yearly, free". Helps downstream code branch correctly.

Per-quantity pricing (B2B)#

Wholesale and B2B sites often show price as "1-9: $29.99, 10-49: $24.99, 50+: $19.99". The LLM can extract this as an array:

{
  "field": "tieredPricing",
  "type": "array<string>",
  "example": ["{minQty: 1, maxQty: 9, price: 29.99}"]
}

Parse downstream. For most B2C use cases you won't see this; for B2B catalog ingestion it's standard.

Out-of-stock with future restock#

"Available June 15, 2026" is neither in-stock nor out-of-stock cleanly. Add an availabilityDate field:

{ "field": "availabilityDate", "type": "date", "example": "2026-06-15" }

Returns null for products available now. Returns the date for backorders/preorders. Combined with inStock: false it gives downstream a complete picture.

Bundles vs grouped products#

Some pages show "this item with accessories" (a bundle product) and others show "these accessories also available" (related products). Confusion is common. The fix: extract the focal product first, related items separately:

[
  { "field": "title",         "type": "string",        "example": "..." },
  { "field": "price",         "type": "float",         "example": 29.99 },
  { "field": "relatedProducts","type": "array<string>", "example": ["{title: '...', url: '...'}"] }
]

The LLM puts the main product in the top-level fields and other products in the array.

Reviews extraction#

Reviews are usually paginated under the product page. Schema for reviews:

[
  { "field": "reviewerName",     "type": "string",  "example": "Sarah M." },
  { "field": "rating",           "type": "float",   "example": 5.0 },
  { "field": "reviewDate",       "type": "date",    "example": "2026-04-15" },
  { "field": "title",            "type": "string",  "example": "Love it" },
  { "field": "body",             "type": "string",  "example": "I bought this..." },
  { "field": "verifiedPurchase", "type": "boolean", "example": true },
  { "field": "helpfulCount",     "type": "integer", "example": 12 }
]

For per-page extraction (multiple reviews per page), wrap as an array. For sentiment analysis on top of this data, see sentiment analysis from product reviews.

Inventory and pricing tracking#

For competitive price monitoring, you usually want fewer fields per product, more frequent extraction:

[
  { "field": "title",   "type": "string",  "example": "Acme Widget" },
  { "field": "price",   "type": "float",   "example": 29.99 },
  { "field": "inStock", "type": "boolean", "example": true },
  { "field": "sku",     "type": "string",  "example": "AW-PRO" }
]

Smaller schema, smaller extraction cost (~$0.0006 vs ~$0.001 per page on Runo Pro tier). At 100K SKUs tracked daily, the per-field savings adds up.

The full pipeline for price monitoring is in monitoring competitor prices with a scraping API.

Image-rich products#

Hotels, real estate, fashion. The page has photos that contain information not in the body text. Floor plans on real estate, food photos on menus, product shots that show details.

A vision-augmentation pass on top of text extraction handles this. Null fields from the text pass are re-evaluated against the page's top-scored images. Useful for:

  • Menu item ingredients (often only in the menu image)
  • Fashion product details (color callouts, fabric textures)
  • Real estate floor plan dimensions
  • Product spec sheets rendered as images

The marginal per-page cost is small (a few image-token reads on top of the text extraction). Some scraping APIs (Runo on Scale via process_images: true) ship this as a built-in option; rolling your own is a vision-model call against the top-N images on the page.

Schema versioning#

When you change a schema in production, downstream consumers can break. Practical patterns:

  • Add fields, never remove or rename. Removing a field breaks downstream; adding doesn't.
  • Version the API surface, not the schema. Your API endpoints stay stable; the schema can evolve.
  • Static keys with bound schemas: use one key per major schema version. Old code keeps using the old key with the old schema; new code uses the new key. Migration is per-consumer.

Common mistakes#

A few patterns to avoid:

Over-decomposing addresses or names#

Don't:

[
  { "field": "addressStreet" },
  { "field": "addressCity" },
  { "field": "addressState" },
  { "field": "addressZip" }
]

Do:

[ { "field": "address", "type": "string", "example": "123 Main St, San Francisco, CA 94103" } ]

The LLM extracts addresses more reliably as a single field. Decompose downstream with usaddress or similar.

Asking for fields that aren't on the page#

If you ask for manufacturerCountryOfOrigin and the page doesn't expose it, you get null 90% of the time. That's not a failure. It's the right answer. But don't be surprised. Pick fields that match what the source actually publishes.

Confusing "out of stock" with "discontinued"#

These are different states. inStock: false covers both; if you need to distinguish, add:

{ "field": "isDiscontinued", "type": "boolean", "example": false }

Most sites don't expose this clearly, so the field will often be null. That's fine.

Treating price as a string#

This is the single most common mistake. A string price ("$29.99") forces every downstream consumer to parse it. A float price (29.99) is one number you can sort and compare. Always use float.

Putting a full product schema together#

For a comprehensive e-commerce extraction:

[
  { "field": "title",          "type": "string",        "example": "Acme Widget Pro" },
  { "field": "brand",          "type": "string",        "example": "Acme" },
  { "field": "sku",            "type": "string",        "example": "AW-PRO-2026" },
  { "field": "gtin",           "type": "string",        "example": "0123456789012" },
  { "field": "price",          "type": "float",         "example": 29.99 },
  { "field": "currency",       "type": "string",        "example": "USD" },
  { "field": "originalPrice",  "type": "float",         "example": 39.99 },
  { "field": "inStock",        "type": "boolean",       "example": true },
  { "field": "availabilityDate","type": "date",         "example": "2026-06-15" },
  { "field": "rating",         "type": "float",         "example": 4.5 },
  { "field": "reviewCount",    "type": "integer",       "example": 142 },
  { "field": "description",    "type": "string",        "example": "The Acme Widget Pro..." },
  { "field": "category",       "type": "string",        "example": "tools > hand tools > widgets" },
  { "field": "specifications", "type": "array<string>", "example": ["weight: 1.2 lbs", "material: steel"] },
  { "field": "imageUrls",      "type": "array<string>", "example": ["https://..."] }
]

15 fields covers what most e-commerce ingestion needs. Add domain-specific fields as needed.

TL;DR#

  • Start with a 10-field spine: title, brand, SKU, price, currency, originalPrice, inStock, rating, reviewCount, description, imageUrls. Add as needed.
  • price is float, not string. Coercion handles "$29.99", "29,99 €". Always.
  • Variants: pick approach by use case. Default-only is fastest; per-variant URL crawl is most complete.
  • Edge cases: multi-currency (pin headers), bundle pricing (extract both forms), subscription (add pricingType), B2B tiered (tieredPricing array), pre-orders (availabilityDate).
  • Don't over-decompose addresses or names. One string is more reliable than four components.
  • Static keys with bound schemas cut payload and let you version schemas per consumer.
  • For image-heavy products (hotels, fashion, real estate), enable process_images: true on Runo Scale for vision augmentation.
  • Common mistake: asking for fields the source doesn't publish. The LLM returns null, which is correct.
The Gulf Stream, by Winslow Homer
Guide9 min read

How to monitor competitor prices with a scraping API

A practical guide to building a competitor price monitoring pipeline. Schema design, change detection, alerting, and the legal and operational pitfalls.

A Scholar in His Study
Guide8 min read

Extracting structured JSON from any HTML: a developer's guide

How to turn arbitrary web pages into typed JSON shaped to your schema. Covers schema design, type coercion, null handling, and edge cases.

Departure of William III from Hellevoetsluis
Guide9 min read

The complete guide to web scraping APIs in 2026

What a modern web scraping API actually does, how to evaluate one, and where each category (proxies, browsers, extractors) fits into a real pipeline.