E-commerce extraction looks like the easy case until you try to write one schema that works across Amazon, Shopify storefronts, BigCommerce, Magento, and a long tail of bespoke sites. The fields look the same; the variations bite.
This post is the schema patterns we've seen work across thousands of e-commerce sites. The schemas are usable as-is; the principles transfer to any LLM-based extraction stack. Background on the schema model is in extracting structured JSON from any HTML.
The product page baseline#
Start with the spine that covers ~90% of product pages:
[
{ "field": "title", "type": "string", "example": "Acme Widget Pro" },
{ "field": "brand", "type": "string", "example": "Acme" },
{ "field": "sku", "type": "string", "example": "AW-PRO-2026" },
{ "field": "price", "type": "float", "example": 29.99 },
{ "field": "currency", "type": "string", "example": "USD" },
{ "field": "originalPrice", "type": "float", "example": 39.99 },
{ "field": "inStock", "type": "boolean", "example": true },
{ "field": "rating", "type": "float", "example": 4.5 },
{ "field": "reviewCount", "type": "integer", "example": 142 },
{ "field": "description", "type": "string", "example": "The Acme Widget Pro features..." },
{ "field": "imageUrls", "type": "array<string>", "example": ["https://..."] }
]
Notes on each field:
title: usually <h1> or product header. LLMs handle this well; rarely a problem.
brand: not always on the page; sometimes only in meta tags or breadcrumb. The LLM picks it up from JSON-LD if present, falls back to inference from the title ("Acme Widget Pro" → brand "Acme").
sku: many sites have multiple identifiers (SKU, MPN, GTIN, ASIN, ISBN). Pick the one your downstream system needs. If you need multiple, declare separate fields.
price: float, not string. Coercion handles "$29.99", "29,99 €", "USD 29.99". Don't try to extract currency separately from price; the LLM will sometimes confuse them. Use a separate currency field.
originalPrice: only meaningful when the product is on sale. On full-price products, this returns null. That's correct: null distinguishes "no sale" from "we couldn't find a sale price."
inStock: boolean. Coerces from "In Stock", "Add to Cart", "Sold Out", "Backorder", "Available". Add a hint if your domain has unusual conventions: hint: "False if 'pre-order' or 'coming soon'".
rating / reviewCount: float for rating (4.5 stars is common), integer for count. Often surfaced in microdata; the structured fast path catches them without an LLM call.
imageUrls: array of all product image URLs. Don't try to filter to "the main one" in extraction. Capture everything, pick downstream.
The variants problem#
Most products have variants: size, color, configuration. Schemas have to model this.
Three approaches, ordered by complexity:
Approach 1: Single product, default variant only#
[
{ "field": "title", "type": "string", "example": "Acme T-Shirt" },
{ "field": "price", "type": "float", "example": 24.99 },
{ "field": "color", "type": "string", "example": "navy" },
{ "field": "size", "type": "string", "example": "M" }
]
Returns whichever variant the product page defaults to. Fine for "show me the price" use cases; loses information about the variant matrix.
Approach 2: Per-page extraction with variant URL crawl#
Each variant has its own URL (?variant=123 or /sku-name). Crawl the variant URLs, extract each as a separate product record.
Slower but complete. Use when you need full variant data (price per color/size, stock per variant).
Approach 3: Variants as an array field#
[
{ "field": "title", "type": "string", "example": "Acme T-Shirt" },
{ "field": "variants", "type": "array<string>", "example": ["{color: 'navy', size: 'M', price: 24.99, inStock: true}"] }
]
LLM extracts the variant matrix as an array of stringified objects. You parse downstream.
This works but is fragile across sites: some hide variant data in JS, some only show selected variants. For high-fidelity extraction, approach 2 is more reliable. For quick extraction, approach 1 is fine.
Category pages: list extraction#
Category pages list multiple products per page. The schema becomes:
[
{
"field": "products",
"type": "array<string>",
"example": ["{title: 'Widget A', price: 29.99, url: 'https://...'}"]
}
]
The LLM returns an array of stringified product objects. Post-process into structured records.
For pagination, use /crawl with a follow pattern:
curl -X POST https://api.scrapewithruno.com/v1/crawl \
-H "X-API-Key: $RUNO_API_KEY" \
-d '{
"seed_url": "https://example-shop.com/category/widgets",
"schema": [...],
"crawl": {
"follow_pattern": "https://example-shop.com/category/widgets?page=*",
"max_pages": 30,
"max_depth": 1
}
}'
For high-fidelity catalog ingestion, use the category pages only for URL discovery, then call /extract per product URL with the full product schema. Two calls but cleaner data.
Edge cases that bite#
Multi-currency sites#
Some sites show prices in the visitor's currency (geo-detected). Set Accept-Language and Accept-Currency headers to pin the currency you want, or use the URL parameter the site exposes (?currency=USD). Without this, you may get inconsistent currency across requests.
"MSRP" vs "list" vs "sale"#
Some sites display three prices: MSRP (manufacturer suggested), list (their normal), sale (current). Three fields:
[
{ "field": "msrp", "type": "float", "example": 39.99 },
{ "field": "listPrice", "type": "float", "example": 29.99 },
{ "field": "salePrice", "type": "float", "example": 24.99 }
]
Or one price (current selling price) plus originalPrice (whatever the strikethrough is). The 2-field model is usually enough; the 3-field model is for retail-analytics use cases.
Bundle pricing#
"Buy 2 for $50" or "3-pack: $79.99". The "price" of the product is ambiguous. Decide what your downstream needs:
- Per-unit price (parse and divide)
- Bundle price (the displayed value)
- Both, with
pricePerUnitandbundlePricefields
The LLM extracts what's on the page; the policy decision is yours.
Subscription pricing#
"$9.99/month" or "$99/year". The price is contextual. Add a pricingType field:
{ "field": "pricingType", "type": "string", "example": "subscription_monthly" }
With a hint: "One of: one_time, subscription_monthly, subscription_yearly, free". Helps downstream code branch correctly.
Per-quantity pricing (B2B)#
Wholesale and B2B sites often show price as "1-9: $29.99, 10-49: $24.99, 50+: $19.99". The LLM can extract this as an array:
{
"field": "tieredPricing",
"type": "array<string>",
"example": ["{minQty: 1, maxQty: 9, price: 29.99}"]
}
Parse downstream. For most B2C use cases you won't see this; for B2B catalog ingestion it's standard.
Out-of-stock with future restock#
"Available June 15, 2026" is neither in-stock nor out-of-stock cleanly. Add an availabilityDate field:
{ "field": "availabilityDate", "type": "date", "example": "2026-06-15" }
Returns null for products available now. Returns the date for backorders/preorders. Combined with inStock: false it gives downstream a complete picture.
Bundles vs grouped products#
Some pages show "this item with accessories" (a bundle product) and others show "these accessories also available" (related products). Confusion is common. The fix: extract the focal product first, related items separately:
[
{ "field": "title", "type": "string", "example": "..." },
{ "field": "price", "type": "float", "example": 29.99 },
{ "field": "relatedProducts","type": "array<string>", "example": ["{title: '...', url: '...'}"] }
]
The LLM puts the main product in the top-level fields and other products in the array.
Reviews extraction#
Reviews are usually paginated under the product page. Schema for reviews:
[
{ "field": "reviewerName", "type": "string", "example": "Sarah M." },
{ "field": "rating", "type": "float", "example": 5.0 },
{ "field": "reviewDate", "type": "date", "example": "2026-04-15" },
{ "field": "title", "type": "string", "example": "Love it" },
{ "field": "body", "type": "string", "example": "I bought this..." },
{ "field": "verifiedPurchase", "type": "boolean", "example": true },
{ "field": "helpfulCount", "type": "integer", "example": 12 }
]
For per-page extraction (multiple reviews per page), wrap as an array. For sentiment analysis on top of this data, see sentiment analysis from product reviews.
Inventory and pricing tracking#
For competitive price monitoring, you usually want fewer fields per product, more frequent extraction:
[
{ "field": "title", "type": "string", "example": "Acme Widget" },
{ "field": "price", "type": "float", "example": 29.99 },
{ "field": "inStock", "type": "boolean", "example": true },
{ "field": "sku", "type": "string", "example": "AW-PRO" }
]
Smaller schema, smaller extraction cost (~$0.0006 vs ~$0.001 per page on Runo Pro tier). At 100K SKUs tracked daily, the per-field savings adds up.
The full pipeline for price monitoring is in monitoring competitor prices with a scraping API.
Image-rich products#
Hotels, real estate, fashion. The page has photos that contain information not in the body text. Floor plans on real estate, food photos on menus, product shots that show details.
A vision-augmentation pass on top of text extraction handles this. Null fields from the text pass are re-evaluated against the page's top-scored images. Useful for:
- Menu item ingredients (often only in the menu image)
- Fashion product details (color callouts, fabric textures)
- Real estate floor plan dimensions
- Product spec sheets rendered as images
The marginal per-page cost is small (a few image-token reads on top of the text extraction). Some scraping APIs (Runo on Scale via process_images: true) ship this as a built-in option; rolling your own is a vision-model call against the top-N images on the page.
Schema versioning#
When you change a schema in production, downstream consumers can break. Practical patterns:
- Add fields, never remove or rename. Removing a field breaks downstream; adding doesn't.
- Version the API surface, not the schema. Your API endpoints stay stable; the schema can evolve.
- Static keys with bound schemas: use one key per major schema version. Old code keeps using the old key with the old schema; new code uses the new key. Migration is per-consumer.
Common mistakes#
A few patterns to avoid:
Over-decomposing addresses or names#
Don't:
[
{ "field": "addressStreet" },
{ "field": "addressCity" },
{ "field": "addressState" },
{ "field": "addressZip" }
]
Do:
[ { "field": "address", "type": "string", "example": "123 Main St, San Francisco, CA 94103" } ]
The LLM extracts addresses more reliably as a single field. Decompose downstream with usaddress or similar.
Asking for fields that aren't on the page#
If you ask for manufacturerCountryOfOrigin and the page doesn't expose it, you get null 90% of the time. That's not a failure. It's the right answer. But don't be surprised. Pick fields that match what the source actually publishes.
Confusing "out of stock" with "discontinued"#
These are different states. inStock: false covers both; if you need to distinguish, add:
{ "field": "isDiscontinued", "type": "boolean", "example": false }
Most sites don't expose this clearly, so the field will often be null. That's fine.
Treating price as a string#
This is the single most common mistake. A string price ("$29.99") forces every downstream consumer to parse it. A float price (29.99) is one number you can sort and compare. Always use float.
Putting a full product schema together#
For a comprehensive e-commerce extraction:
[
{ "field": "title", "type": "string", "example": "Acme Widget Pro" },
{ "field": "brand", "type": "string", "example": "Acme" },
{ "field": "sku", "type": "string", "example": "AW-PRO-2026" },
{ "field": "gtin", "type": "string", "example": "0123456789012" },
{ "field": "price", "type": "float", "example": 29.99 },
{ "field": "currency", "type": "string", "example": "USD" },
{ "field": "originalPrice", "type": "float", "example": 39.99 },
{ "field": "inStock", "type": "boolean", "example": true },
{ "field": "availabilityDate","type": "date", "example": "2026-06-15" },
{ "field": "rating", "type": "float", "example": 4.5 },
{ "field": "reviewCount", "type": "integer", "example": 142 },
{ "field": "description", "type": "string", "example": "The Acme Widget Pro..." },
{ "field": "category", "type": "string", "example": "tools > hand tools > widgets" },
{ "field": "specifications", "type": "array<string>", "example": ["weight: 1.2 lbs", "material: steel"] },
{ "field": "imageUrls", "type": "array<string>", "example": ["https://..."] }
]
15 fields covers what most e-commerce ingestion needs. Add domain-specific fields as needed.
TL;DR#
- Start with a 10-field spine: title, brand, SKU, price, currency, originalPrice, inStock, rating, reviewCount, description, imageUrls. Add as needed.
priceisfloat, notstring. Coercion handles"$29.99","29,99 €". Always.- Variants: pick approach by use case. Default-only is fastest; per-variant URL crawl is most complete.
- Edge cases: multi-currency (pin headers), bundle pricing (extract both forms), subscription (add
pricingType), B2B tiered (tieredPricingarray), pre-orders (availabilityDate). - Don't over-decompose addresses or names. One string is more reliable than four components.
- Static keys with bound schemas cut payload and let you version schemas per consumer.
- For image-heavy products (hotels, fashion, real estate), enable
process_images: trueon Runo Scale for vision augmentation. - Common mistake: asking for fields the source doesn't publish. The LLM returns
null, which is correct.