Scraping JavaScript-heavy SPAs: Next.js, Nuxt, and React in 2026

The Eclipse in Venice

You curl a product page. The response is 200 OK with 12KB of HTML. You parse it. There's no product. There's no price. There's a <div id="__next"></div> and a giant <script> tag.

This is the most common silent failure in web scraping in 2026. The page renders fine in your browser because your browser runs JavaScript. Your scraper doesn't, so it gets the loading skeleton. Below is what actually happens, what your options are, and which option is the cheap one most people miss.

What "single-page app" means for your scraper#

Modern frontends ship in three rough flavours:

  1. SSR (server-side rendering): the server runs the framework and returns fully-formed HTML with the data inlined. Your plain fetch works.
  2. CSR (client-side rendering): the server returns an empty shell. The browser fetches data over XHR/fetch and renders it. Your plain fetch returns the shell.
  3. Hybrid (SSG + hydration / partial SSR / streaming): the server returns HTML plus a JSON blob that hydrates the client. Your plain fetch returns the HTML and the blob.

Most production sites in 2026 are hybrid. Next.js, Nuxt, Remix, SvelteKit, and Astro all ship server-rendered HTML by default. Pure-CSR React apps still exist, but mainly for internal dashboards, not public-facing pages with SEO needs.

The hybrid case is good news for you. The data your scraper wants is probably already in the response, just not where you're looking.

The blob you're missing#

Look for these script tags in the HTML you already have:

Framework Script ID / shape
Next.js <script id="__NEXT_DATA__" type="application/json">
Nuxt 3 <script id="__NUXT_DATA__" type="application/json">
Nuxt 2 <script>window.__NUXT__=...</script>
Remix <script>window.__remixContext = ...</script>
SvelteKit <script type="application/json" data-sveltekit-fetched> (per-fetch payloads)
Apollo Client <script>window.__APOLLO_STATE__=...</script>
Generic Redux <script>window.__INITIAL_STATE__=...</script>

The contents are JSON describing the props the framework used to render the page. For a product page, that JSON usually contains the entire product object, often with more fields than the rendered page exposes (internal IDs, full description text, A/B variant flags, related items).

If your schema fields are in the blob, you don't need a headless browser at all. Plain HTTP fetch, regex out the script tag, parse JSON, done. Hundreds of times faster than rendering, free.

A worked example: extracting from __NEXT_DATA__#

import re, json, httpx

html = httpx.get("https://example.com/product/widget-pro").text
m = re.search(
    r'<script id="__NEXT_DATA__"[^>]*>(.+?)</script>',
    html, re.DOTALL
)
if m:
    data = json.loads(m.group(1))
    product = data["props"]["pageProps"]["product"]
    print(product["price"], product["title"], product["sku"])

Five lines for the path that works. The path that doesn't work involves spinning up Chromium, waiting for network idle, taking a DOM snapshot, and re-parsing. About 5 seconds slower per page and ~100MB of memory.

If you're using Runo or any LLM-based extractor, the script blob is fed to the LLM as part of the cleaned page content. The LLM picks values out of the JSON automatically without you having to know the framework's data shape. This is one of the quiet wins of LLM extraction over selector-based scrapers, covered in LLM extraction vs CSS selectors.

When the blob isn't enough#

Some pages truly defer data fetch to the client. Symptoms:

  • The HTML response is small (under 5KB body content)
  • No __NEXT_DATA__ / __NUXT_DATA__ / equivalent script
  • Visible XHR or fetch() calls in browser devtools fetching the data after page load

Three options, in order of cost:

1. Call the JSON API directly#

Open your browser's network tab, find the XHR that returns the data, copy the URL and headers. Most of the time you can hit it directly.

data = httpx.get(
    "https://example.com/api/products/widget-pro",
    headers={"Accept": "application/json", "X-Requested-With": "XMLHttpRequest"}
).json()

This is usually the cheapest, fastest, and most reliable approach. The downsides:

  • The API may require auth tokens minted by the page
  • The API may rate-limit aggressively
  • The API surface can change without notice (no SEO incentive to keep it stable)

For one-off scraping, the risks are tolerable. For production pipelines against many sites, the per-site reverse engineering cost adds up fast.

2. Render with a headless browser#

If the API path is blocked or the data is computed client-side, you need a real browser. Playwright with stealth patches handles 90% of this:

from playwright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True)
    page = await browser.new_page()
    await page.goto("https://example.com/product/widget-pro",
                    wait_until="networkidle")
    html = await page.content()

Costs: ~2-5 seconds per page, plus memory and CPU for the browser. Usually worth pre-warming a browser instance and reusing tabs.

The traps to watch for: wait_until="networkidle" can hang on sites that keep long-polling connections open; cap with a timeout. Modern anti-bot vendors detect headless Chromium via subtle DOM/navigator inconsistencies. Use patchright or playwright-stealth to patch the most common signals. We covered the full bypass story in how to scrape Cloudflare-protected sites.

3. Use a scraping API that auto-escalates#

Plain fetch is free, fast, and works for ~80% of pages. Headless rendering is slow and expensive but works for the remaining 20%. The right architecture tries cheap first, escalates only when needed.

Runo's render_js: "auto" mode does this transparently. Plain fetch first; if the response looks empty (block signature, framework markers, near-empty body), escalate to stealth headless. The caller sees the result either way; only the latency tells you which path ran.

curl -X POST https://api.scrapewithruno.com/v1/extract \
  -H "X-API-Key: $RUNO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/widget-pro",
    "schema": [
      {"field": "title", "type": "string", "example": "Widget Pro"},
      {"field": "price", "type": "float",  "example": 29.99}
    ],
    "options": {"render_js": "auto"}
  }'

The response includes render_mode: "fetch" | "headless" so you know which path ran without changing your code.

Detection heuristics: when to escalate#

If you're rolling your own, the rules of thumb that work in practice:

  • Body under 500 chars: probably a shell, escalate
  • Visible text under 200 chars with HTML over 5KB: probably a JS-rendered page, escalate
  • Presence of framework markers (__NEXT_DATA__, __NUXT_DATA__, data-reactroot, ng-version): try the script blob first, escalate if blob is missing or doesn't contain your fields
  • HTTP 403 / 429 / 503 with anti-bot signature: don't escalate yet; try TLS impersonation first (curl_cffi), then headless

The last point matters more than people expect. A lot of "JS-required" failures are actually fingerprint failures at the TCP/TLS layer that happen before the browser even runs JavaScript. Your fix is curl_cffi impersonating a real Chrome TLS handshake, not a headless browser.

Pagination, infinite scroll, and click-to-load#

The other class of "broken on plain fetch" pages is infinite scroll and click-to-load. The first viewport renders fine; everything below requires user interaction.

Three patterns:

  1. Find the underlying API: same as above. Infinite scroll usually paginates with ?page=N or a cursor, easy to enumerate.
  2. Headless with scroll loop: render the page, scroll to bottom in a loop until height stops changing or you hit a max scroll count. Works but slow.
  3. Headless with click loop: simulate clicks on "Load more" until the button disappears. Fragile; selectors break.

For most use cases, option 1 is the cheap one. Option 2 is the catch-all. Option 3 is rarely worth the maintenance.

Frameworks that do their own thing#

A few frameworks need special handling:

  • Astro with islands: most content is server-rendered HTML; only specific interactive components hydrate. Plain fetch usually works.
  • Qwik: aggressively code-split with resumability. Initial HTML is complete; you don't need to render JS for content extraction.
  • Phoenix LiveView: server pushes HTML diffs over WebSocket. The initial HTML is complete; later updates require the WebSocket connection. For scraping the initial state, plain fetch works.

The pattern: most modern frameworks ship complete HTML by default for SEO reasons. Pure-CSR is rarer than it was in the React-with-CRA era.

Cost comparison#

For a 1M-page-per-month scraping job hitting hybrid SSR sites:

Approach Per-page time Per-page cost Reliability
Plain fetch + script blob extraction ~0.5s ~$0.00001 High when blob present
API direct call (when reverse-engineered) ~0.3s ~$0.00001 High; brittle to API changes
Headless browser (always) ~3-5s ~$0.0003 High
Auto-escalating fetch then headless ~0.8s avg ~$0.00007 avg High

The auto-escalating path wins on cost and latency at scale. Doing it yourself means writing block-detection heuristics and a per-host memory of what tier worked. Doing it via Runo means setting render_js: "auto" and not thinking about it.

What to verify when you're stuck#

If a scrape returns empty data on a JS-heavy site, work through this checklist before reaching for the headless hammer:

  1. Search the response HTML for __NEXT_DATA__, __NUXT_DATA__, __APOLLO_STATE__. If present, parse the JSON.
  2. Open browser devtools, check the network tab for an XHR returning the data. If found, call it directly.
  3. Check response status: 403 / 429 / 503 means anti-bot, not JS. Try curl_cffi with impersonate="chrome124" first.
  4. Check response body size. Under 500 chars? Probably needs JS. Over 30KB but mostly script tags? Look for the data blob.
  5. Only after the above: spin up Playwright.

TL;DR#

  • "JS-heavy" sites usually still ship the data in the HTML response, in a __NEXT_DATA__ / __NUXT_DATA__ / equivalent script blob. Parse the JSON; skip the headless browser.
  • For real CSR pages, try the underlying JSON API before headless rendering. It's faster, cheaper, more reliable.
  • Headless is the catch-all but the slowest and most expensive option. Use stealth patches (patchright, playwright-stealth) to avoid fingerprint blocks.
  • A lot of "JS-required" failures are actually TLS fingerprint failures. Try curl_cffi impersonation before reaching for Chromium.
  • Build an auto-escalating fetcher (plain → TLS impersonation → headless) or use one. Runo's render_js: "auto" ships this server-side.
Le Cheval de Troie
Engineering8 min read

Headless browser fingerprinting in 2026: how detection works and what to do

A technical breakdown of the signals anti-bot services use to detect headless browsers, and the patches that close the gap.

Seascape Study with Rain Cloud
Engineering7 min read

How to scrape Cloudflare-protected sites without getting blocked

A practical, layered approach to defeating Cloudflare's bot challenges in 2026. TLS fingerprints, hardened headless, cookie persistence, and when to escalate.

An Experiment on a Bird in the Air Pump
Engineering7 min read

LLM extraction vs CSS selectors: why selector-based scraping is dead at scale

Selectors break when sites redesign. LLMs extract by semantic meaning. Here's why the tradeoff has flipped, with cost numbers from real workloads.