Building a real-estate data pipeline with a scraping API

Real estate is one of the cleanest scraping use cases there is. Listings are public, the schema is well understood, and the value of fresh structured data for everything from comparables to investor analytics to relocation tools is obvious. The hard part is the pipeline that takes "URLs from five different portals" and produces "one normalized table I can query."

This walkthrough is that pipeline. The schema, the dedup logic, the refresh cadence, and the cost numbers. The examples use Runo; the architecture works with any extraction API.

The data model#

Listings have more variation than you'd expect. A San Francisco condo, a rural land parcel, a New York co-op, and a vacation rental are all "real estate" but their meaningful fields barely overlap. The trick is a schema with a stable spine plus optional category-specific extensions.

Here's a stable spine that covers ~95% of residential and commercial listings:

[
  { "field": "listingTitle",      "type": "string",        "example": "3BR/2BA Modern Condo with City View" },
  { "field": "address",           "type": "string",        "example": "123 Main St, San Francisco, CA 94103" },
  { "field": "price",             "type": "float",         "example": 875000.00 },
  { "field": "currency",          "type": "string",        "example": "USD" },
  { "field": "listingType",       "type": "string",        "example": "for sale" },
  { "field": "propertyType",      "type": "string",        "example": "condo" },
  { "field": "bedrooms",          "type": "integer",       "example": 3 },
  { "field": "bathrooms",         "type": "float",         "example": 2.5 },
  { "field": "squareFeet",        "type": "integer",       "example": 1450 },
  { "field": "lotSize",           "type": "string",        "example": "0.15 acres" },
  { "field": "yearBuilt",         "type": "integer",       "example": 2015 },
  { "field": "listedDate",        "type": "date",          "example": "2026-04-22" },
  { "field": "description",       "type": "string",        "example": "Stunning corner unit..." },
  { "field": "amenities",         "type": "array<string>", "example": ["parking", "gym", "pool"] },
  { "field": "agentName",         "type": "string",        "example": "Sarah Chen" },
  { "field": "brokerage",         "type": "string",        "example": "Compass" },
  { "field": "imageUrls",         "type": "array<string>", "example": ["https://..."] }
]

Float for bathrooms (because half-baths exist), string for lotSize (acres, sq ft, hectares all show up; normalize downstream), array for amenities and images. The listingType and propertyType will need post-processing into your enum (Zillow says "Condo", Redfin says "Condominium", same thing).

For commercial or rental listings, add: monthlyRent, leaseTerm, availableDate, petsAllowed. Reuse the spine for everything else.

The fetch loop#

Two stages: discover URLs, then extract from each.

Stage 1: discovery#

Three patterns depending on the portal:

Sitemap-based: most large real-estate portals expose /sitemap.xml with all listing URLs. Easiest case. Pull the sitemap, filter by region or date, queue the URLs.
Search-page crawl: hit the search results page for a region and follow the listing links. Use Runo's /crawl endpoint with follow_pattern set to the listing URL pattern.
API endpoints: some portals have undocumented JSON APIs that return listing data directly. Faster when available; brittle when they change.

Sitemap-based is the cleanest because it's deliberately published for crawlers and doesn't violate robots.txt (since it is robots.txt's recommendation).

import httpx, re

sitemap = httpx.get("https://example-portal.com/sitemap-listings.xml").text
urls = re.findall(r"<loc>(https://example-portal\.com/listing/[^<]+)</loc>", sitemap)
print(f"Discovered {len(urls)} listings")

Stage 2: extraction#

For each URL, call the extraction API with the schema:

import httpx, asyncio

API_KEY = "sk_static_..."  # static key with the listing schema bound

async def extract(client, url):
    r = await client.post(
        "https://api.scrapewithruno.com/v1/extract",
        json={"url": url},
        headers={"X-API-Key": API_KEY},
        timeout=60,
    )
    return r.json()

async def main(urls):
    async with httpx.AsyncClient() as client:
        results = await asyncio.gather(*(extract(client, u) for u in urls))
    return results

A static key with the schema pre-bound (rather than sending the schema in every request) cuts payload size. Worth it if you're running this against thousands of listings.

For batches over ~50 URLs, use /batch directly:

r = httpx.post(
    "https://api.scrapewithruno.com/v1/batch",
    json={"urls": urls, "options": {"concurrency": 10}},
    headers={"X-API-Key": API_KEY},
    timeout=600,
)

/batch charges 1 request per URL, runs them concurrently, and supports cancellation via DELETE /v1/jobs/{job_id} if you need to abort.

Normalization#

Different portals report things differently. Some patterns to handle on ingest:

Issue	Example	Normalization
Property type vocabulary	"Condo" / "Condominium" / "Apartment"	Map to enum: `{condo, single-family, multi-family, townhouse, land, commercial}`
Bath fractions	"2.5" / "2 full, 1 half" / "2F1H"	Parse to float: full + 0.5 × half
Square footage	"1,450 sq ft" / "1450 SF" / "135 m²"	Parse + convert all to sq ft
Lot size units	"0.15 acres" / "6,500 sq ft" / "650 m²"	Parse + convert to acres (or sq ft, pick one)
Listing date	"Listed 5 days ago" / "April 22, 2026" / "2026-04-22"	Schema's `date` type handles this; ISO 8601 out
Address format	"123 Main St" vs "123 Main Street"	USPS standardization (use a library like `usaddress`)
Price absent ("Contact for price")	`null`	Honest null preserves the signal

Note that the LLM extraction layer handles a lot of this for you. The date type coerces relative dates. The float type parses currency strings. The schema's example value anchors the format. What you still own: the enum mappings (property type, listing type) and unit conversions (sq ft, acres).

Dedup#

Same listing across multiple portals, same listing posted twice on the same portal, same listing relisted after price drop. All real. The dedup key has to be robust to all three.

The cheapest robust key is (normalized_address, propertyType, bedrooms, listingType). Tighten with (latitude, longitude) if you have geocoding. For relisted listings, a soft dedup on (normalized_address, listedDate within 90 days) lets you collapse "same property, listed in March, relisted in May at lower price" into one record with a price history.

from hashlib import sha256

def dedup_key(listing):
    parts = [
        normalize_address(listing["address"]),
        listing["propertyType"],
        str(listing["bedrooms"]),
        listing["listingType"],
    ]
    return sha256("|".join(parts).encode()).hexdigest()[:16]

Store the dedup key in your database with a unique index. On ingest, upsert by dedup key with the latest scrape's data.

Refresh cadence#

Real estate moves fast in some markets and slow in others. A blanket "scrape everything daily" pipeline burns money on stale data. A smarter cadence based on signal:

Signal	Refresh
Active listing in hot market (LA, NYC, SF)	Daily
Active listing in slow market	Every 3 days
Listing under contract / pending	Weekly (track when it goes off-market)
Sold listing (terminal state)	Never (archive)
Listing older than 180 days, no price changes	Weekly
Listing with recent price changes	Daily for 7 days, then back off

If your scraping API ships an application-layer result cache, it handles a chunk of this for free. Most APIs let you override the TTL per-request for hot listings.

Geographic enrichment#

Listings come with addresses, not coordinates. You'll want both for filtering, mapping, distance queries, and clustering. Two approaches:

Geocode at ingest with a service like Mapbox or Google Geocoding. Per-request cost ~$0.005 at scale; cheaper if you cache.
Bulk geocode periodically with a batch geocoding service. Cheaper per-record; longer feedback loop.

Cache the result by normalized address, not by raw input. Addresses get re-scraped many times; you only need to geocode each unique address once.

Cost math#

Per-listing cost components, at typical scale (1M listings/month):

Component	Per-listing cost	Notes
Extraction API (Scale tier of a hosted vendor)	~$0.001	Tier pricing varies by vendor
Geocoding	~$0.0001	Mapbox bulk, with caching
Database (Postgres)	~$0.00002	Amortised across ingest+queries
Storage (images, raw HTML archive)	~$0.00001	S3 with lifecycle policies
Total	~$0.0010	Per fresh listing extracted

For 1M listings/month: ~$1,000 in extraction + auxiliary. Compare to the cost of building the bypass + extraction stack yourself.

What about MLS?#

For US residential listings, the underlying truth is the regional MLS (Multiple Listing Service). Most public portals (Zillow, Redfin, Realtor.com) license MLS feeds and republish.

If you're a licensed agent or working with one, you can subscribe directly to RETS / RESO Web API feeds and skip the public portals entirely. Authoritative, structured, no scraping.

If you're not a licensed agent, the public-portal route is your option, and it's a fine option. Just understand that you're scraping the redistribution, not the source. Stale or selectively-presented data is possible.

Scraping etiquette for real estate#

Real-estate portals tend to be lawyered up and have anti-scraping infrastructure. Stay safe:

Stick to logged-out scraping of public listing pages.
Respect robots.txt.
Per-host concurrency under 5; jitter between requests.
Don't republish photos or descriptions verbatim. Extract structured facts (price, beds, baths, address) which aren't copyrightable.
If you're going to commercialize, get a lawyer to look at your specific use case. We covered the legal landscape in is web scraping legal in 2026.

Putting it together#

The full pipeline:

sitemap discovery
    ↓
URL queue (Redis / SQS)
    ↓
extraction worker (calls Runo /extract or /batch)
    ↓
normalization (enum mapping, unit conversion)
    ↓
dedup (compute dedup_key, upsert)
    ↓
geocoding enrichment (cached by address)
    ↓
Postgres (with PostGIS for spatial queries)
    ↓
your application (search UI, alerts, analytics)

For a v1, you can compress the queue + workers + database into one Python script running on a single box. Scale out when you need to.

The complete cost for a side project with ~10K listings refreshed daily: ~$15-25/month on the Runo Starter tier. The same project at 1M listings/month: ~$1,000/month total. Either is significantly cheaper than rolling your own bypass + extraction stack.

What we ship vs what you build#

Using Runo for the extraction layer:

Bypass (Cloudflare, Datadome, Akamai if a portal uses them) handled
Schema-driven extraction with type coercion
Per-host rate adaptation
Cancellable jobs
Result caching

What you still build:

The discovery loop (sitemaps, crawl seeds)
The dedup logic (dedup keys, upsert)
The normalization mappings (enums, units)
The geocoding integration
The query and presentation layer

The extraction layer is usually the part where vendors deliver the most value per engineering hour. The pipeline glue is yours either way.

TL;DR#

Real-estate scraping has a stable schema spine: address, price, beds, baths, sq ft, dates, agent, images. Add category-specific fields for rentals, commercial, land.
Discovery via sitemap is cleanest; search-page crawl works when sitemaps are absent; undocumented JSON APIs are fastest but brittle.
Static-key extraction (schema pre-bound) cuts payload at scale.
Normalize: property type and listing type to enums, units to your standard (sq ft + acres), addresses with USPS standardization.
Dedup key: (normalized_address, propertyType, bedrooms, listingType). Tighten with lat/lng if geocoded.
Smart refresh cadence beats blanket daily scrapes. Hot listings daily, sold listings archived, price changes trigger short-term daily watch.
Total cost at 1M listings/month: ~$1,000 with Runo Scale tier + auxiliary services. Significantly cheaper than building the stack yourself.
Don't republish photos or descriptions; extract structured facts. Stay logged-out. Respect robots.txt.