Lead generation from public web data: a builder's guide

Shearing the Rams

Lead generation is the highest-value scraping use case for most B2B companies, and also the one most likely to land you in legal trouble if done sloppily. Everyone wants to "scrape LinkedIn" because that's where the names live. The problem: LinkedIn's terms of service explicitly forbid it, the platform sues regularly, and the data quality from scraped LinkedIn profiles is often worse than alternatives that are fully legal.

This post is about building a lead-gen pipeline from public sources that produces better data, costs less, and doesn't risk your business. The legal context is in is web scraping legal in 2026; this post is the operational playbook.

What "qualified lead" actually means#

A lead is the contact information of a person at a company who might buy your thing. "Qualified" means you have enough context to know whether they're a fit before reaching out. The fields that matter:

  • Person: name, role/title, work email, work phone (rare, optional)
  • Company: name, domain, industry, size, location, funding stage (B2B SaaS-relevant), tech stack (relevant for some products)
  • Context: why this person is a potential fit (recent hiring, a job posting matching your ICP, a public statement about a problem you solve)

The "context" field is what separates good outbound from spam. A list of 10K names with no context is junk. A list of 200 names where each has a documented reason to care converts.

Where the data actually lives#

Three categories of public source, ordered by quality:

Tier 1: Company websites#

The single highest-quality source. The company tells you directly:

  • What they do (/about, /products)
  • Who works there (/team, /leadership)
  • What roles they're hiring (/careers, also a buying signal)
  • Their tech stack (footer logos, blog posts, conference talks)
  • Contact patterns (info@, often deducible from team page)

Limitations: most company sites don't list every employee. You'll get leadership and a sample, not a full org chart.

Tier 2: Public registries and directories#

Government and quasi-government sources are gold for verifiable company facts:

  • SEC filings (US public companies): EDGAR has 10-K, 10-Q, S-1, proxy statements. Officer/director names, executive comp, business descriptions, all public.
  • Companies House (UK): officer details, filing history, registered address.
  • OpenCorporates: aggregated registry data across 140+ jurisdictions.
  • Crunchbase, PitchBook (paid APIs): funding and exec data, structured.
  • GitHub: org pages, contributor profiles, tech stack signal.
  • Conference speaker pages: people who care enough about a topic to give a talk are warm leads for related products.
  • Patent filings (USPTO, EPO): inventor names tied to companies, technical depth signal.
  • Job boards (company career pages, LinkedIn Jobs, We Work Remotely, AngelList): roles open right now signal what the company is investing in.

These are explicitly public, often legally required to be public, and almost always allowed to be scraped politely.

Tier 3: Aggregator sites#

Apollo, ZoomInfo, Clearbit, Hunter.io, UpLead. These build their datasets by aggregating across many sources (some scraping, some submitted, some emails verified by sending).

You can either subscribe to one of them and skip the build, or you can build your own pipeline against the underlying sources. Build-vs-buy math depends on your volume:

  • Under 1,000 leads/month: subscribe to an aggregator
  • 1K to 50K leads/month: hybrid (aggregator for the easy stuff, your pipeline for the niches)
  • Over 50K leads/month: build, with one or two aggregators as fallback enrichment

Avoid: scraping LinkedIn directly#

The hiQ v. LinkedIn settlement (covered in our legal post) confirmed that public-page scraping isn't a CFAA violation. But:

  • LinkedIn's TOS forbids automated access; logged-in scraping breaches the contract you agreed to at signup.
  • LinkedIn is well-funded for legal action and pursues it.
  • The data quality from scraped LinkedIn (especially if you're hiding your scraping) is often worse than what you'd get by combining 2-3 explicit public sources.

If you specifically need LinkedIn data, the safe path is the official LinkedIn Sales Navigator API or a licensed reseller (e.g. Apollo, which has explicit data partnerships). Yes, it costs money. The alternative costs more in the long run.

A reference pipeline#

For a B2B SaaS company building outbound on US tech companies, a reference pipeline:

Step 1: build a company list
    ↓
Step 2: enrich each company (size, industry, funding, tech stack)
    ↓
Step 3: identify decision-makers per company
    ↓
Step 4: find work emails for those decision-makers
    ↓
Step 5: capture context (why this person, why now)
    ↓
Step 6: dedup, score, hand to sales

Let's walk through each step with concrete tools.

Step 1: build a company list#

Sources for the seed list:

  • Crunchbase (paid) for funded startups in your ICP categories
  • BuiltWith for companies using a specific technology
  • GitHub org search for companies with a public repo footprint
  • Job boards (Greenhouse, Lever, Ashby) for companies hiring specific roles
  • Conference attendee/speaker lists

You'll end up with 1K-50K company domains. Dedup on root domain (example.com, not app.example.com).

Step 2: enrich each company#

For each company domain, fetch the company website and extract a structured profile:

[
  { "field": "companyName",   "type": "string",        "example": "Acme Corp" },
  { "field": "tagline",       "type": "string",        "example": "Modern infrastructure for X" },
  { "field": "industry",      "type": "string",        "example": "B2B SaaS" },
  { "field": "headquarters",  "type": "string",        "example": "San Francisco, CA" },
  { "field": "employeeCount", "type": "string",        "example": "11-50" },
  { "field": "foundedYear",   "type": "integer",       "example": 2021 },
  { "field": "products",      "type": "array<string>", "example": ["Acme Cloud", "Acme Edge"] },
  { "field": "technologies",  "type": "array<string>", "example": ["React", "Postgres", "AWS"] }
]

Run this once per company against the company's /about or homepage. Schema-driven extraction (e.g. with Runo) handles the variation across thousands of differently-designed company sites without you writing per-site selectors.

For BuiltWith-style tech-stack data, you can also pull /robots.txt, /.well-known/security.txt, and inspect HTTP response headers for clues (Cloudflare, Vercel, Cloudfront, AWS).

Step 3: identify decision-makers#

For each company, scrape the team or leadership page:

[
  { "field": "name",  "type": "string", "example": "Sarah Chen" },
  { "field": "title", "type": "string", "example": "VP Engineering" },
  { "field": "bio",   "type": "string", "example": "Sarah leads the engineering team..." }
]

A /team page often returns an array of people. If it's a list page, define an array-typed schema or crawl the page-per-person if each person has a detail page.

For enterprise companies that don't list staff publicly, supplement with:

  • Recent press releases mentioning hires
  • SEC proxy statements (US public companies must disclose officer comp)
  • Conference speaker pages (find the speakers who work at target companies)
  • GitHub contributor data (for engineering hires at companies with open source)

Step 4: find work emails#

Two approaches:

  1. Pattern guessing + verification. Most companies use one of ~12 email patterns (first.last@, first@, flast@, etc.). Determine the pattern from one known address (often easy to find on the company site or via a press release), then generate candidates for other staff and verify with an email-verification service (NeverBounce, ZeroBounce, Hunter).
  2. Email-finding APIs. Hunter, Apollo, and similar return work emails for a given (name, domain) pair. Per-lookup cost ~$0.05-0.10.

The pattern + verification approach is cheaper at scale; the API approach is faster to ship.

Step 5: capture context#

This is the step that separates qualified leads from a list. For each lead, capture why this person at this company is a fit. Patterns:

  • They posted a job opening matching your ICP signal (hiring a "VP Data" if you sell to data leaders)
  • The company recently raised a funding round (Crunchbase signal)
  • They published a blog post or talk about a problem you solve
  • They added a tech stack signal that aligns with your product
  • They appeared in a recent industry list (Forbes, FastCompany, Inc.)
  • They moved roles recently (job change is a buying-window signal)

Each context type has its own scrape source. The pipeline annotates each lead with which signals fired.

Step 6: dedup, score, hand to sales#

Dedup on (work_email) if you have it, otherwise (name, domain). Score on (number of context signals fired) × (ICP fit score). Hand the top decile to sales.

Cost math#

For a pipeline producing 10K qualified leads/month:

Step Cost
Company list source (Crunchbase / BuiltWith / etc.) $300-1,500/mo
Company-page extraction (10K companies × 2 pages × ~$0.001) ~$20
Team-page extraction (10K companies × 1 page × ~$0.001) ~$10
Email verification (10K × $0.005) $50
Job-posting / context scraping ~$30
Total scraping infrastructure (Runo Pro tier) $59/mo

Total: roughly $470-$1,670/mo for 10K qualified leads. Compare to:

  • Buying equivalent leads from Apollo / ZoomInfo: $1,000-$2,500/mo for similar volume but lower context quality
  • Hiring a sales researcher to build manually: ~$5,000/mo for ~2K leads

The pipeline wins on cost and quality if you have engineering bandwidth to build it once.

Privacy and personal data#

Scraping personal data (names, work emails, titles) triggers GDPR for any EU resident in your dataset. The defensive posture:

  1. Document a Legitimate Interests Assessment (LIA) per EDPB guidance. B2B prospecting is a recognized legitimate interest, but you need the assessment on file.
  2. Build a deletion pipeline. Data subjects can request deletion; you need to honor it within 30 days.
  3. Don't scrape sensitive categories. Health, ethnicity, political opinion, sexual orientation. These are special categories under GDPR and require explicit consent.
  4. Honor opt-outs. If someone replies "remove me," remove them from your database, not just the next campaign.
  5. Don't enrich with intimate personal details. Personal social media, partner names, kids' schools. None of that should appear in your CRM. Stay strictly professional.

CCPA (California) is similar but less stringent. Many other US states have followed (Colorado, Virginia, Connecticut, Utah, Texas as of 2025). Build for the strictest jurisdiction your prospects might be in, which in practice is GDPR.

Email deliverability#

Scraped lead lists often get burned by senders who blast them without warming up. The work email is technically valid, but the sending reputation tanks because:

  • High bounce rate from stale data
  • Low engagement from cold recipients
  • Spam complaints from people who didn't expect contact

Defensive practices:

  • Verify every email before sending; cull bounces aggressively
  • Warm up sending domains with low-volume engagement first
  • Use a separate sending domain (mail.yourcompany.com) to protect your primary domain reputation
  • Personalize each first message with at least one specific signal from your context capture
  • Stop sending to a thread the moment the recipient says no

Where Runo fits#

Runo handles the extraction layer: pull company sites, team pages, job postings, press releases, SEC filings, conference pages. Get back typed JSON. The schema-driven approach means you write the schema once and it works across thousands of differently-designed sites without per-site code. Combined with the bypass stack (Cloudflare, Datadome, etc. handled server-side), it removes the "scrape everything" plumbing from your pipeline.

You build the orchestration: discovery, dedup, scoring, CRM integration, GDPR compliance, sending pipeline.

The free Runo tier is 500 requests/month, enough to validate the pipeline against a few hundred prospects before scaling up.

TL;DR#

  • Best lead-gen sources are public registries (SEC EDGAR, Companies House), company websites, and job boards. Avoid logged-in LinkedIn scraping; the legal exposure isn't worth it.
  • Pipeline: company list → company enrichment → decision-maker identification → email finding → context capture → scoring → handoff.
  • Schema-driven extraction (e.g. via Runo) handles variation across thousands of differently-designed company sites without per-site selectors.
  • Context capture (job postings, funding signals, recent talks) is what separates qualified leads from spam fodder. Build for context, not just emails.
  • 10K qualified leads/month costs ~$470-$1,670 to produce yourself; buying equivalent volume from Apollo/ZoomInfo is $1K-$2.5K with lower context quality.
  • GDPR applies for EU resident data: document LIA, build deletion pipeline, honor opt-outs, stay strictly professional, don't enrich with personal details.
  • Email deliverability matters: verify before sending, warm up domains, separate sending domain, personalize, stop on first no.
Lithograph by Honoré Daumier
Legal9 min read

Is web scraping legal in 2026? A practical guide for builders

What courts, regulators, and contracts actually say about scraping public web data, with the case law that shaped the current landscape and a working playbook.

Departure of William III from Hellevoetsluis
Guide9 min read

The complete guide to web scraping APIs in 2026

What a modern web scraping API actually does, how to evaluate one, and where each category (proxies, browsers, extractors) fits into a real pipeline.

Michelangelo in His Studio Visited by Pope Julius II, by Alexandre Cabanel
Comparison8 min read

Firecrawl vs Apify vs Runo: which scraping API to pick in 2026

An honest, side-by-side look at three popular scraping APIs. What each is built for, where each shines, and where each costs you time and money.