Building a real-estate data pipeline with a scraping API

View of the Colosseum

Real estate is one of the cleanest scraping use cases there is. Listings are public, the schema is well understood, and the value of fresh structured data for everything from comparables to investor analytics to relocation tools is obvious. The hard part is the pipeline that takes "URLs from five different portals" and produces "one normalized table I can query."

This walkthrough is that pipeline. The schema, the dedup logic, the refresh cadence, and the cost numbers. The examples use Runo; the architecture works with any extraction API.

The data model#

Listings have more variation than you'd expect. A San Francisco condo, a rural land parcel, a New York co-op, and a vacation rental are all "real estate" but their meaningful fields barely overlap. The trick is a schema with a stable spine plus optional category-specific extensions.

Here's a stable spine that covers ~95% of residential and commercial listings:

[
  { "field": "listingTitle",      "type": "string",        "example": "3BR/2BA Modern Condo with City View" },
  { "field": "address",           "type": "string",        "example": "123 Main St, San Francisco, CA 94103" },
  { "field": "price",             "type": "float",         "example": 875000.00 },
  { "field": "currency",          "type": "string",        "example": "USD" },
  { "field": "listingType",       "type": "string",        "example": "for sale" },
  { "field": "propertyType",      "type": "string",        "example": "condo" },
  { "field": "bedrooms",          "type": "integer",       "example": 3 },
  { "field": "bathrooms",         "type": "float",         "example": 2.5 },
  { "field": "squareFeet",        "type": "integer",       "example": 1450 },
  { "field": "lotSize",           "type": "string",        "example": "0.15 acres" },
  { "field": "yearBuilt",         "type": "integer",       "example": 2015 },
  { "field": "listedDate",        "type": "date",          "example": "2026-04-22" },
  { "field": "description",       "type": "string",        "example": "Stunning corner unit..." },
  { "field": "amenities",         "type": "array<string>", "example": ["parking", "gym", "pool"] },
  { "field": "agentName",         "type": "string",        "example": "Sarah Chen" },
  { "field": "brokerage",         "type": "string",        "example": "Compass" },
  { "field": "imageUrls",         "type": "array<string>", "example": ["https://..."] }
]

Float for bathrooms (because half-baths exist), string for lotSize (acres, sq ft, hectares all show up; normalize downstream), array for amenities and images. The listingType and propertyType will need post-processing into your enum (Zillow says "Condo", Redfin says "Condominium", same thing).

For commercial or rental listings, add: monthlyRent, leaseTerm, availableDate, petsAllowed. Reuse the spine for everything else.

The fetch loop#

Two stages: discover URLs, then extract from each.

Stage 1: discovery#

Three patterns depending on the portal:

  1. Sitemap-based: most large real-estate portals expose /sitemap.xml with all listing URLs. Easiest case. Pull the sitemap, filter by region or date, queue the URLs.
  2. Search-page crawl: hit the search results page for a region and follow the listing links. Use Runo's /crawl endpoint with follow_pattern set to the listing URL pattern.
  3. API endpoints: some portals have undocumented JSON APIs that return listing data directly. Faster when available; brittle when they change.

Sitemap-based is the cleanest because it's deliberately published for crawlers and doesn't violate robots.txt (since it is robots.txt's recommendation).

import httpx, re

sitemap = httpx.get("https://example-portal.com/sitemap-listings.xml").text
urls = re.findall(r"<loc>(https://example-portal\.com/listing/[^<]+)</loc>", sitemap)
print(f"Discovered {len(urls)} listings")

Stage 2: extraction#

For each URL, call the extraction API with the schema:

import httpx, asyncio

API_KEY = "sk_static_..."  # static key with the listing schema bound

async def extract(client, url):
    r = await client.post(
        "https://api.scrapewithruno.com/v1/extract",
        json={"url": url},
        headers={"X-API-Key": API_KEY},
        timeout=60,
    )
    return r.json()

async def main(urls):
    async with httpx.AsyncClient() as client:
        results = await asyncio.gather(*(extract(client, u) for u in urls))
    return results

A static key with the schema pre-bound (rather than sending the schema in every request) cuts payload size. Worth it if you're running this against thousands of listings.

For batches over ~50 URLs, use /batch directly:

r = httpx.post(
    "https://api.scrapewithruno.com/v1/batch",
    json={"urls": urls, "options": {"concurrency": 10}},
    headers={"X-API-Key": API_KEY},
    timeout=600,
)

/batch charges 1 request per URL, runs them concurrently, and supports cancellation via DELETE /v1/jobs/{job_id} if you need to abort.

Normalization#

Different portals report things differently. Some patterns to handle on ingest:

Issue Example Normalization
Property type vocabulary "Condo" / "Condominium" / "Apartment" Map to enum: {condo, single-family, multi-family, townhouse, land, commercial}
Bath fractions "2.5" / "2 full, 1 half" / "2F1H" Parse to float: full + 0.5 × half
Square footage "1,450 sq ft" / "1450 SF" / "135 m²" Parse + convert all to sq ft
Lot size units "0.15 acres" / "6,500 sq ft" / "650 m²" Parse + convert to acres (or sq ft, pick one)
Listing date "Listed 5 days ago" / "April 22, 2026" / "2026-04-22" Schema's date type handles this; ISO 8601 out
Address format "123 Main St" vs "123 Main Street" USPS standardization (use a library like usaddress)
Price absent ("Contact for price") null Honest null preserves the signal

Note that the LLM extraction layer handles a lot of this for you. The date type coerces relative dates. The float type parses currency strings. The schema's example value anchors the format. What you still own: the enum mappings (property type, listing type) and unit conversions (sq ft, acres).

Dedup#

Same listing across multiple portals, same listing posted twice on the same portal, same listing relisted after price drop. All real. The dedup key has to be robust to all three.

The cheapest robust key is (normalized_address, propertyType, bedrooms, listingType). Tighten with (latitude, longitude) if you have geocoding. For relisted listings, a soft dedup on (normalized_address, listedDate within 90 days) lets you collapse "same property, listed in March, relisted in May at lower price" into one record with a price history.

from hashlib import sha256

def dedup_key(listing):
    parts = [
        normalize_address(listing["address"]),
        listing["propertyType"],
        str(listing["bedrooms"]),
        listing["listingType"],
    ]
    return sha256("|".join(parts).encode()).hexdigest()[:16]

Store the dedup key in your database with a unique index. On ingest, upsert by dedup key with the latest scrape's data.

Refresh cadence#

Real estate moves fast in some markets and slow in others. A blanket "scrape everything daily" pipeline burns money on stale data. A smarter cadence based on signal:

Signal Refresh
Active listing in hot market (LA, NYC, SF) Daily
Active listing in slow market Every 3 days
Listing under contract / pending Weekly (track when it goes off-market)
Sold listing (terminal state) Never (archive)
Listing older than 180 days, no price changes Weekly
Listing with recent price changes Daily for 7 days, then back off

If your scraping API ships an application-layer result cache, it handles a chunk of this for free. Most APIs let you override the TTL per-request for hot listings.

Geographic enrichment#

Listings come with addresses, not coordinates. You'll want both for filtering, mapping, distance queries, and clustering. Two approaches:

  1. Geocode at ingest with a service like Mapbox or Google Geocoding. Per-request cost ~$0.005 at scale; cheaper if you cache.
  2. Bulk geocode periodically with a batch geocoding service. Cheaper per-record; longer feedback loop.

Cache the result by normalized address, not by raw input. Addresses get re-scraped many times; you only need to geocode each unique address once.

Cost math#

Per-listing cost components, at typical scale (1M listings/month):

Component Per-listing cost Notes
Extraction API (Scale tier of a hosted vendor) ~$0.001 Tier pricing varies by vendor
Geocoding ~$0.0001 Mapbox bulk, with caching
Database (Postgres) ~$0.00002 Amortised across ingest+queries
Storage (images, raw HTML archive) ~$0.00001 S3 with lifecycle policies
Total ~$0.0010 Per fresh listing extracted

For 1M listings/month: ~$1,000 in extraction + auxiliary. Compare to the cost of building the bypass + extraction stack yourself.

What about MLS?#

For US residential listings, the underlying truth is the regional MLS (Multiple Listing Service). Most public portals (Zillow, Redfin, Realtor.com) license MLS feeds and republish.

If you're a licensed agent or working with one, you can subscribe directly to RETS / RESO Web API feeds and skip the public portals entirely. Authoritative, structured, no scraping.

If you're not a licensed agent, the public-portal route is your option, and it's a fine option. Just understand that you're scraping the redistribution, not the source. Stale or selectively-presented data is possible.

Scraping etiquette for real estate#

Real-estate portals tend to be lawyered up and have anti-scraping infrastructure. Stay safe:

  • Stick to logged-out scraping of public listing pages.
  • Respect robots.txt.
  • Per-host concurrency under 5; jitter between requests.
  • Don't republish photos or descriptions verbatim. Extract structured facts (price, beds, baths, address) which aren't copyrightable.
  • If you're going to commercialize, get a lawyer to look at your specific use case. We covered the legal landscape in is web scraping legal in 2026.

Putting it together#

The full pipeline:

sitemap discovery
    ↓
URL queue (Redis / SQS)
    ↓
extraction worker (calls Runo /extract or /batch)
    ↓
normalization (enum mapping, unit conversion)
    ↓
dedup (compute dedup_key, upsert)
    ↓
geocoding enrichment (cached by address)
    ↓
Postgres (with PostGIS for spatial queries)
    ↓
your application (search UI, alerts, analytics)

For a v1, you can compress the queue + workers + database into one Python script running on a single box. Scale out when you need to.

The complete cost for a side project with ~10K listings refreshed daily: ~$15-25/month on the Runo Starter tier. The same project at 1M listings/month: ~$1,000/month total. Either is significantly cheaper than rolling your own bypass + extraction stack.

What we ship vs what you build#

Using Runo for the extraction layer:

  • Bypass (Cloudflare, Datadome, Akamai if a portal uses them) handled
  • Schema-driven extraction with type coercion
  • Per-host rate adaptation
  • Cancellable jobs
  • Result caching

What you still build:

  • The discovery loop (sitemaps, crawl seeds)
  • The dedup logic (dedup keys, upsert)
  • The normalization mappings (enums, units)
  • The geocoding integration
  • The query and presentation layer

The extraction layer is usually the part where vendors deliver the most value per engineering hour. The pipeline glue is yours either way.

TL;DR#

  • Real-estate scraping has a stable schema spine: address, price, beds, baths, sq ft, dates, agent, images. Add category-specific fields for rentals, commercial, land.
  • Discovery via sitemap is cleanest; search-page crawl works when sitemaps are absent; undocumented JSON APIs are fastest but brittle.
  • Static-key extraction (schema pre-bound) cuts payload at scale.
  • Normalize: property type and listing type to enums, units to your standard (sq ft + acres), addresses with USPS standardization.
  • Dedup key: (normalized_address, propertyType, bedrooms, listingType). Tighten with lat/lng if geocoded.
  • Smart refresh cadence beats blanket daily scrapes. Hot listings daily, sold listings archived, price changes trigger short-term daily watch.
  • Total cost at 1M listings/month: ~$1,000 with Runo Scale tier + auxiliary services. Significantly cheaper than building the stack yourself.
  • Don't republish photos or descriptions; extract structured facts. Stay logged-out. Respect robots.txt.
The Maas at Dordrecht
Tutorial8 min read

Building a news aggregator with the /crawl endpoint

Walkthrough of a working news aggregator: source discovery, crawl configuration, dedup across sources, and a 24-hour ingest cadence that scales.

Medusa, by Caravaggio
Tutorial9 min read

Sentiment analysis from product reviews: a practical pipeline

How to scrape product reviews at scale and turn them into actionable sentiment data. Schema design, aspect-based sentiment, and avoiding the common pitfalls.

The Art of Painting, by Johannes Vermeer
Ecommerce10 min read

Schema design patterns for e-commerce extraction

Battle-tested schema patterns for product pages, category pages, reviews, and inventory. Edge cases, type choices, and the fields people forget.