Schema design patterns for e-commerce extraction

E-commerce extraction looks like the easy case until you try to write one schema that works across Amazon, Shopify storefronts, BigCommerce, Magento, and a long tail of bespoke sites. The fields look the same; the variations bite.

This post is the schema patterns we've seen work across thousands of e-commerce sites. The schemas are usable as-is; the principles transfer to any LLM-based extraction stack. Background on the schema model is in extracting structured JSON from any HTML.

The product page baseline#

Start with the spine that covers ~90% of product pages:

[
  { "field": "title",         "type": "string",        "example": "Acme Widget Pro" },
  { "field": "brand",         "type": "string",        "example": "Acme" },
  { "field": "sku",           "type": "string",        "example": "AW-PRO-2026" },
  { "field": "price",         "type": "float",         "example": 29.99 },
  { "field": "currency",      "type": "string",        "example": "USD" },
  { "field": "originalPrice", "type": "float",         "example": 39.99 },
  { "field": "inStock",       "type": "boolean",       "example": true },
  { "field": "rating",        "type": "float",         "example": 4.5 },
  { "field": "reviewCount",   "type": "integer",       "example": 142 },
  { "field": "description",   "type": "string",        "example": "The Acme Widget Pro features..." },
  { "field": "imageUrls",     "type": "array<string>", "example": ["https://..."] }
]

Notes on each field:

title: usually <h1> or product header. LLMs handle this well; rarely a problem.

brand: not always on the page; sometimes only in meta tags or breadcrumb. The LLM picks it up from JSON-LD if present, falls back to inference from the title ("Acme Widget Pro" → brand "Acme").

sku: many sites have multiple identifiers (SKU, MPN, GTIN, ASIN, ISBN). Pick the one your downstream system needs. If you need multiple, declare separate fields.

price: float, not string. Coercion handles "$29.99", "29,99 €", "USD 29.99". Don't try to extract currency separately from price; the LLM will sometimes confuse them. Use a separate currency field.

originalPrice: only meaningful when the product is on sale. On full-price products, this returns null. That's correct: null distinguishes "no sale" from "we couldn't find a sale price."

inStock: boolean. Coerces from "In Stock", "Add to Cart", "Sold Out", "Backorder", "Available". Add a hint if your domain has unusual conventions: hint: "False if 'pre-order' or 'coming soon'".

rating / reviewCount: float for rating (4.5 stars is common), integer for count. Often surfaced in microdata; the structured fast path catches them without an LLM call.

imageUrls: array of all product image URLs. Don't try to filter to "the main one" in extraction. Capture everything, pick downstream.

The variants problem#

Most products have variants: size, color, configuration. Schemas have to model this.

Three approaches, ordered by complexity:

Approach 1: Single product, default variant only#

[
  { "field": "title",   "type": "string", "example": "Acme T-Shirt" },
  { "field": "price",   "type": "float",  "example": 24.99 },
  { "field": "color",   "type": "string", "example": "navy" },
  { "field": "size",    "type": "string", "example": "M" }
]

Returns whichever variant the product page defaults to. Fine for "show me the price" use cases; loses information about the variant matrix.

Approach 2: Per-page extraction with variant URL crawl#

Each variant has its own URL (?variant=123 or /sku-name). Crawl the variant URLs, extract each as a separate product record.

Slower but complete. Use when you need full variant data (price per color/size, stock per variant).

Approach 3: Variants as an array field#

[
  { "field": "title",    "type": "string",        "example": "Acme T-Shirt" },
  { "field": "variants", "type": "array<string>", "example": ["{color: 'navy', size: 'M', price: 24.99, inStock: true}"] }
]

LLM extracts the variant matrix as an array of stringified objects. You parse downstream.

This works but is fragile across sites: some hide variant data in JS, some only show selected variants. For high-fidelity extraction, approach 2 is more reliable. For quick extraction, approach 1 is fine.

Category pages: list extraction#

Category pages list multiple products per page. The schema becomes:

[
  {
    "field": "products",
    "type": "array<string>",
    "example": ["{title: 'Widget A', price: 29.99, url: 'https://...'}"]
  }
]

The LLM returns an array of stringified product objects. Post-process into structured records.

For pagination, use /crawl with a follow pattern:

curl -X POST https://api.scrapewithruno.com/v1/crawl \
  -H "X-API-Key: $RUNO_API_KEY" \
  -d '{
    "seed_url": "https://example-shop.com/category/widgets",
    "schema": [...],
    "crawl": {
      "follow_pattern": "https://example-shop.com/category/widgets?page=*",
      "max_pages": 30,
      "max_depth": 1
    }
  }'

For high-fidelity catalog ingestion, use the category pages only for URL discovery, then call /extract per product URL with the full product schema. Two calls but cleaner data.

Edge cases that bite#

Multi-currency sites#

Some sites show prices in the visitor's currency (geo-detected). Set Accept-Language and Accept-Currency headers to pin the currency you want, or use the URL parameter the site exposes (?currency=USD). Without this, you may get inconsistent currency across requests.

"MSRP" vs "list" vs "sale"#

Some sites display three prices: MSRP (manufacturer suggested), list (their normal), sale (current). Three fields:

[
  { "field": "msrp",      "type": "float", "example": 39.99 },
  { "field": "listPrice", "type": "float", "example": 29.99 },
  { "field": "salePrice", "type": "float", "example": 24.99 }
]

Or one price (current selling price) plus originalPrice (whatever the strikethrough is). The 2-field model is usually enough; the 3-field model is for retail-analytics use cases.

Bundle pricing#

"Buy 2 for $50" or "3-pack: $79.99". The "price" of the product is ambiguous. Decide what your downstream needs:

Per-unit price (parse and divide)
Bundle price (the displayed value)
Both, with pricePerUnit and bundlePrice fields

The LLM extracts what's on the page; the policy decision is yours.

Subscription pricing#

"$9.99/month" or "$99/year". The price is contextual. Add a pricingType field:

{ "field": "pricingType", "type": "string", "example": "subscription_monthly" }

With a hint: "One of: one_time, subscription_monthly, subscription_yearly, free". Helps downstream code branch correctly.

Per-quantity pricing (B2B)#

Wholesale and B2B sites often show price as "1-9: $29.99, 10-49: $24.99, 50+: $19.99". The LLM can extract this as an array:

{
  "field": "tieredPricing",
  "type": "array<string>",
  "example": ["{minQty: 1, maxQty: 9, price: 29.99}"]
}

Parse downstream. For most B2C use cases you won't see this; for B2B catalog ingestion it's standard.

Out-of-stock with future restock#

"Available June 15, 2026" is neither in-stock nor out-of-stock cleanly. Add an availabilityDate field:

{ "field": "availabilityDate", "type": "date", "example": "2026-06-15" }

Returns null for products available now. Returns the date for backorders/preorders. Combined with inStock: false it gives downstream a complete picture.

Bundles vs grouped products#

Some pages show "this item with accessories" (a bundle product) and others show "these accessories also available" (related products). Confusion is common. The fix: extract the focal product first, related items separately:

[
  { "field": "title",         "type": "string",        "example": "..." },
  { "field": "price",         "type": "float",         "example": 29.99 },
  { "field": "relatedProducts","type": "array<string>", "example": ["{title: '...', url: '...'}"] }
]

The LLM puts the main product in the top-level fields and other products in the array.

Reviews extraction#

Reviews are usually paginated under the product page. Schema for reviews:

[
  { "field": "reviewerName",     "type": "string",  "example": "Sarah M." },
  { "field": "rating",           "type": "float",   "example": 5.0 },
  { "field": "reviewDate",       "type": "date",    "example": "2026-04-15" },
  { "field": "title",            "type": "string",  "example": "Love it" },
  { "field": "body",             "type": "string",  "example": "I bought this..." },
  { "field": "verifiedPurchase", "type": "boolean", "example": true },
  { "field": "helpfulCount",     "type": "integer", "example": 12 }
]

For per-page extraction (multiple reviews per page), wrap as an array. For sentiment analysis on top of this data, see sentiment analysis from product reviews.

Inventory and pricing tracking#

For competitive price monitoring, you usually want fewer fields per product, more frequent extraction:

[
  { "field": "title",   "type": "string",  "example": "Acme Widget" },
  { "field": "price",   "type": "float",   "example": 29.99 },
  { "field": "inStock", "type": "boolean", "example": true },
  { "field": "sku",     "type": "string",  "example": "AW-PRO" }
]

Smaller schema, smaller extraction cost (~$0.0006 vs ~$0.001 per page on Runo Pro tier). At 100K SKUs tracked daily, the per-field savings adds up.

The full pipeline for price monitoring is in monitoring competitor prices with a scraping API.

Image-rich products#

Hotels, real estate, fashion. The page has photos that contain information not in the body text. Floor plans on real estate, food photos on menus, product shots that show details.

A vision-augmentation pass on top of text extraction handles this. Null fields from the text pass are re-evaluated against the page's top-scored images. Useful for:

Menu item ingredients (often only in the menu image)
Fashion product details (color callouts, fabric textures)
Real estate floor plan dimensions
Product spec sheets rendered as images

The marginal per-page cost is small (a few image-token reads on top of the text extraction). Some scraping APIs (Runo on Scale via process_images: true) ship this as a built-in option; rolling your own is a vision-model call against the top-N images on the page.

Schema versioning#

When you change a schema in production, downstream consumers can break. Practical patterns:

Add fields, never remove or rename. Removing a field breaks downstream; adding doesn't.
Version the API surface, not the schema. Your API endpoints stay stable; the schema can evolve.
Static keys with bound schemas: use one key per major schema version. Old code keeps using the old key with the old schema; new code uses the new key. Migration is per-consumer.

Common mistakes#

A few patterns to avoid:

Over-decomposing addresses or names#

Don't:

[
  { "field": "addressStreet" },
  { "field": "addressCity" },
  { "field": "addressState" },
  { "field": "addressZip" }
]

Do:

[ { "field": "address", "type": "string", "example": "123 Main St, San Francisco, CA 94103" } ]

The LLM extracts addresses more reliably as a single field. Decompose downstream with usaddress or similar.

Asking for fields that aren't on the page#

If you ask for manufacturerCountryOfOrigin and the page doesn't expose it, you get null 90% of the time. That's not a failure. It's the right answer. But don't be surprised. Pick fields that match what the source actually publishes.

Confusing "out of stock" with "discontinued"#

These are different states. inStock: false covers both; if you need to distinguish, add:

{ "field": "isDiscontinued", "type": "boolean", "example": false }

Most sites don't expose this clearly, so the field will often be null. That's fine.

Treating `price` as a string#

This is the single most common mistake. A string price ("$29.99") forces every downstream consumer to parse it. A float price (29.99) is one number you can sort and compare. Always use float.

Putting a full product schema together#

For a comprehensive e-commerce extraction:

[
  { "field": "title",          "type": "string",        "example": "Acme Widget Pro" },
  { "field": "brand",          "type": "string",        "example": "Acme" },
  { "field": "sku",            "type": "string",        "example": "AW-PRO-2026" },
  { "field": "gtin",           "type": "string",        "example": "0123456789012" },
  { "field": "price",          "type": "float",         "example": 29.99 },
  { "field": "currency",       "type": "string",        "example": "USD" },
  { "field": "originalPrice",  "type": "float",         "example": 39.99 },
  { "field": "inStock",        "type": "boolean",       "example": true },
  { "field": "availabilityDate","type": "date",         "example": "2026-06-15" },
  { "field": "rating",         "type": "float",         "example": 4.5 },
  { "field": "reviewCount",    "type": "integer",       "example": 142 },
  { "field": "description",    "type": "string",        "example": "The Acme Widget Pro..." },
  { "field": "category",       "type": "string",        "example": "tools > hand tools > widgets" },
  { "field": "specifications", "type": "array<string>", "example": ["weight: 1.2 lbs", "material: steel"] },
  { "field": "imageUrls",      "type": "array<string>", "example": ["https://..."] }
]

15 fields covers what most e-commerce ingestion needs. Add domain-specific fields as needed.

TL;DR#

Start with a 10-field spine: title, brand, SKU, price, currency, originalPrice, inStock, rating, reviewCount, description, imageUrls. Add as needed.
price is float, not string. Coercion handles "$29.99", "29,99 €". Always.
Variants: pick approach by use case. Default-only is fastest; per-variant URL crawl is most complete.
Edge cases: multi-currency (pin headers), bundle pricing (extract both forms), subscription (add pricingType), B2B tiered (tieredPricing array), pre-orders (availabilityDate).
Don't over-decompose addresses or names. One string is more reliable than four components.
Static keys with bound schemas cut payload and let you version schemas per consumer.
For image-heavy products (hotels, fashion, real estate), enable process_images: true on Runo Scale for vision augmentation.
Common mistake: asking for fields the source doesn't publish. The LLM returns null, which is correct.