How to scrape Cloudflare-protected sites without getting blocked

If you've tried to scrape a Cloudflare-protected site this year, you know the routine. curl returns the "Verifying you are human" interstitial. Headless Chrome gets caught by the bot manager. Residential proxies help on some sites and don't on others. The bypass landscape is messier than it was even six months ago because the defender side is iterating fast, and a lot of the public scraping advice is from 2023 and no longer works.

This is what's actually working in 2026. The pattern is a layered ladder of techniques, escalating cost and complexity, defaulting to free techniques first and reaching for paid tools only when the cheap ones run out.

What "Cloudflare bot protection" actually checks#

Cloudflare runs a stack, not a check. Knowing the layers tells you which technique addresses which signal.

IP reputation. Datacenter IPs (AWS, GCP, DigitalOcean) get scored hostile by default. Residential IPs are scored by their history.
TLS fingerprint (JA3/JA4). The way your client negotiates TLS, including cipher order, extension order, signature algorithms, fingerprints the library you're using, not the user-agent string. requests, httpx, Go net/http, and Chrome all have distinct, identifiable fingerprints.
HTTP/2 fingerprint. Frame settings, header order, priority frames. Same idea as TLS, one layer up.
Header signature. Real browsers send 15+ headers in a specific order with realistic values (sec-ch-ua-*, accept-language, sec-fetch-*). Most scraping libraries send 4–6.
JavaScript challenge. A small JS payload runs in your page context; the result is checked. Headless browsers can run it, plain HTTP clients cannot.
Browser fingerprint. Once JS runs, Cloudflare can fingerprint the browser via canvas, WebGL, audio context, screen properties, font enumeration, navigator properties. Headless Chromium leaks here unless heavily patched.
Behavioural signals. Mouse movement patterns, request cadence, page-time, click positions. Hard to fake at scale.
CAPTCHA / Turnstile. The escalation path when the score is ambiguous. Solvable, at a cost.

Each layer rejects you for a different reason. There is no single "Cloudflare bypass". There's a ladder of techniques addressing each layer in order of cost.

The bypass ladder#

This is the hierarchy. Always start cheap.

Level 1. Plain HTTP fetch#

httpx or requests with a real-browser user-agent and 12+ realistic headers. Works on a surprising amount of "protected" content because Cloudflare's score for a benign-looking GET to a public marketing page is often passable. Cost: $0.

If this returns content with the actual data, ship it. Don't escalate just because the site uses Cloudflare. Escalate when you actually get blocked.

Level 2. TLS fingerprint impersonation#

The next bar. Use curl_cffi or tls-client to impersonate Chrome's TLS handshake at the JA3/JA4 level. This alone defeats a non-trivial fraction of "plain fetch fails" because the rejection was actually at the TLS layer, not the JS challenge layer.

from curl_cffi import requests
r = requests.get(url, impersonate="chrome124", headers={...})

Cost: $0. Speed: same as plain HTTP. If level 1 fails and level 2 succeeds, great. You saved a headless browser.

Level 3. Hardened headless browser#

When the page legitimately needs JS execution, reach for a headless browser. Vanilla Playwright/Puppeteer is heavily fingerprinted, so use a hardened distribution:

patchright: drop-in Playwright with stealth patches
camoufox: Firefox build with anti-fingerprinting baked in
playwright-stealth: patches on top of upstream Playwright

Configure realistic viewport, locale, timezone, and font fingerprints. Disable automation flags (--disable-blink-features=AutomationControlled). Inject realistic sec-ch-ua-* headers.

A well-tuned hardened headless setup defeats most consumer-facing Cloudflare deployments, including a lot of e-commerce, news, and social. Cost: $0 plus your compute.

When a Cloudflare-protected site issues cf_clearance (and Datadome issues _abck, PerimeterX issues _pxhd, etc.), persist them per-host. Many sites grade you up to "trusted" after one successful challenge, and subsequent requests skip the challenge entirely.

A short-TTL Redis (or equivalent) cache keyed by hostname is enough. Reusing a cf_clearance cookie across requests for the same domain often saves 8+ seconds per fetch and dodges the challenge entirely.

Cost: $0 plus a small storage line.

Level 5. CAPTCHA solver#

When the site demands a Turnstile, hCaptcha, or reCAPTCHA token, you need a solver. CapSolver and 2Captcha are common options; pricing is roughly $0.0008–0.003 per solve depending on type.

This is where free techniques end. CAPTCHA solvers cost real money per request and add latency (typically 8–25 seconds per solve). Use them only on hosts you've identified as requiring them. Don't solve speculatively.

Level 6. Residential proxy rotation#

When IP reputation is the dominant signal (think e-commerce sites that score AWS-range IPs hostile regardless of fingerprint), rotate through residential IPs. IPRoyal, Bright Data, Smartproxy. Cost: ~$0.0002–0.001 per request depending on provider and concurrency.

Levels 5 and 6 stack. Some sites need both.

Level 7. Archive fallback#

When everything fails, the data may already exist somewhere. Run parallel races against:

Google Cache (webcache.googleusercontent.com/search?q=cache:URL)
Wayback Machine (web.archive.org/web/2*/URL)
AMP version (URL?output=amp for Google-cached AMP pages)
Reader-view extractors

Whichever responds first with usable content wins. The data may be a few hours stale but the bypass cost is $0.

What not to do#

A lot of public scraping advice is now actively counterproductive. Avoid:

Spoofing only the user-agent. UA-only spoofing was solved in 2018. The TLS fingerprint, HTTP/2 fingerprint, and sec-ch-ua-* headers all need to match the UA you claim.
Random user-agent rotation per request. Real browsers don't change their UA between requests on the same session. Rotation is a fingerprint.
Using Selenium for new work in 2026. Selenium's automation flags leak everywhere. Playwright + stealth patches is strictly better.
Stacking residential proxies before trying TLS impersonation. Proxies are expensive; TLS impersonation is free and often the actual fix. Test in cost order.
Speculative CAPTCHA solving. Don't solve a CAPTCHA you weren't asked for. Wait for the challenge to appear.
Ignoring robots.txt. Even if you intend to scrape regardless, the robots check tells you what the site considers acceptable, which informs etiquette like rate limits and User-Agent identification. The legal posture in 2026 increasingly cares.

Detecting blocks proactively#

The wrong move is to wait until you get blocked, then escalate. The right move is to detect the block as soon as the response comes back and skip ahead. Signals worth checking:

Body length under 500 chars on a normally-large page
Visible text under 200 chars when HTML is multi-KB
HTTP status 402, 403, 406, 429, 503
Block signatures in body: "Cloudflare", "Datadome", "Verifying you are human", "Access denied", "Just a moment", "PerimeterX", "Akamai"
Specific challenge HTML markers (__cf_chl_opt, cf-mitigated, _pxAppId)

If any of these fire, skip the result and escalate one tier. This avoids feeding "Verifying you are human" pages into your downstream LLM extractor and getting back nonsense.

Per-host memory#

After enough requests against a domain, you learn its bypass profile. Cache it. A short-TTL key (24 hours is a sensible default) recording the lowest level that succeeded means the next request to the same host skips lower levels entirely, saving meaningful time on known-walled hosts.

Don't cache too aggressively, though. Sites change. A 24-hour TTL gives the cache utility without trapping you on stale assumptions.

Rate, jitter, and back-off#

Even when bypass works, getting blocked because you hit a site too hard is a separate failure mode. A few rules:

Per-host concurrency cap (4 parallel requests per host is a sensible default)
Jitter between requests on the same host (200–800ms randomised)
Exponential back-off on 429 and 503, honouring Retry-After headers
Adapt the back-off based on observed response times. If a site slows down, you're hammering it.

Aggressive scrapers get banned. Polite scrapers get data.

When to give up and use an extraction API#

If you're reading this looking to build all of the above, that's a reasonable choice and a real engineering investment. The bypass surface is never "done" because the defender side keeps moving.

If your real goal is data and bypass is a cost on the way there, an AI extraction API like Runo ships the full ladder by default, with bot-walled-site bypass enabled on the paid tiers. The docs walk through the schema-driven extraction layer that sits on top.

If you want to roll it yourself, the layered approach above is the working pattern. Start cheap, escalate proactively when you detect blocks, cache the host profile, and never speculate.

TL;DR#

Cloudflare runs a stack of checks (IP, TLS fingerprint, HTTP/2 fingerprint, headers, JS challenge, browser fingerprint, behaviour, CAPTCHA). One bypass technique addresses one layer.
Use a layered ladder: plain HTTP, TLS impersonation (curl_cffi), hardened headless (patchright/camoufox), cookie persistence, CAPTCHA solver, residential proxies, archive fallback.
Always start cheap; escalate proactively on detected block signatures, not after timeout.
Don't UA-spoof without matching TLS. Don't rotate UA per request. Don't use Selenium for new work. Don't speculate on CAPTCHAs.
Cache per-host bypass profile with a short TTL. Jitter, back off, honour Retry-After.
If you'd rather not maintain the stack, Runo ships it by default.