Product reviews are the highest-signal source of customer truth on the internet, and most companies barely use them. The reasons are predictable: the data is hard to get out of review platforms, the volume is too high to read manually, and naive sentiment analysis ("positive vs negative") loses the actually useful information.
This post is a working pipeline that solves all three. The output is per-review structured data that includes overall sentiment, aspect-level sentiment (price, quality, shipping, support), specific complaints, and feature requests. You can pipe that into a dashboard, a competitive analysis, or a product roadmap.
What we're trying to extract#
A naive sentiment pipeline gives you {review_text: "...", sentiment: "positive"}. That's almost worthless. The actually useful schema:
[
{ "field": "reviewerName", "type": "string", "example": "Sarah M." },
{ "field": "rating", "type": "float", "example": 4.5 },
{ "field": "reviewDate", "type": "date", "example": "2026-04-15" },
{ "field": "verifiedPurchase", "type": "boolean", "example": true },
{ "field": "title", "type": "string", "example": "Great battery life, mediocre camera" },
{ "field": "body", "type": "string", "example": "I've had this phone for two months..." },
{ "field": "overallSentiment", "type": "string", "example": "positive" },
{ "field": "aspectSentiments", "type": "array<string>", "example": ["battery: positive", "camera: negative"] },
{ "field": "specificIssues", "type": "array<string>", "example": ["camera focus is slow in low light"] },
{ "field": "featureRequests", "type": "array<string>", "example": ["wireless charging support"] },
{ "field": "useCase", "type": "string", "example": "primary phone for travel photography" }
]
That schema gives you data you can actually act on. "23 of last month's reviews mentioned slow camera focus" is a product fix. "47% of negative reviews mention shipping" is an ops fix. "The most-requested feature is wireless charging" is a roadmap item.
Why this works with LLM extraction#
Aspect-based sentiment is the place where traditional NLP libraries (VADER, TextBlob, even fine-tuned BERTs) fall apart. They give you per-document sentiment fine, but they can't separate "the battery is great" from "the camera is terrible" within the same paragraph. You need a model that understands the review as a whole and attributes feelings to specific aspects.
LLMs do this natively. Pass the review text + a schema with aspectSentiments as an array, and the model returns the breakdown. This is the same pattern as extracting structured JSON from any HTML, applied to a specific use case.
The example value in the schema (["battery: positive", "camera: negative"]) tells the model the format you want. Without that anchor, you'd get inconsistent shapes (["battery", "positive"] one time, {aspect: "battery", sentiment: "positive"} the next). With it, the model produces the same shape every call.
The fetch loop#
Two stages: discover review URLs, extract from each.
Discovering review URLs#
For most e-commerce platforms, reviews are paginated under a product URL:
- Amazon:
/product/.../reviews?pageNumber=N(logged-in scraping not advised; use the public review fragment) - Trustpilot:
/review/{domain}?page=N. Clean public URLs - G2 / Capterra (B2B SaaS reviews):
/products/{slug}/reviews?page=N - Yelp:
/biz/{slug}?start={offset} - Google Maps reviews: harder; usually requires the Places API rather than scraping
The cleanest case is Trustpilot, G2, and Capterra. All are explicitly designed for sharing and have stable review URLs.
For each product, you walk the pagination until you hit the last page or a date threshold (e.g., "stop after 6 months of history"). Use Runo's /crawl endpoint with a follow_pattern matching the next-page URL:
curl -X POST https://api.scrapewithruno.com/v1/crawl \
-H "X-API-Key: $RUNO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"seed_url": "https://www.trustpilot.com/review/example.com",
"schema": [...the schema above...],
"crawl": {
"follow_pattern": "https://www.trustpilot.com/review/example.com?page=*",
"max_pages": 50,
"max_depth": 1
}
}'
/crawl reserves max_pages from your quota upfront and refunds unused. If the product only has 12 pages of reviews, you pay for 12, not 50.
Extracting from review pages#
A review page has many reviews. Two ways to get them out:
- Define a schema with array-of-objects fields and let the LLM return all reviews on the page in one call.
- Have the LLM return one "primary review" per call and crawl per-review URLs (when the platform exposes them, like Trustpilot).
Option 1 is cheaper and faster (one LLM call per page). Option 2 produces cleaner data when reviews have rich detail pages with photos, "helpful" counts, and replies.
For most use cases, option 1 wins. The schema becomes:
[
{
"field": "reviews",
"type": "array<string>",
"example": ["{name: 'Sarah', rating: 5, body: '...'}"]
}
]
Then you post-process the array into structured records. Or, if your extraction API supports nested schemas (Runo uses a flat schema by design, but you can extract to a string and parse), use one.
A cleaner Runo pattern: extract page-level summary fields (overall product rating, review count, distribution) plus the most prominent review per page, and rely on /crawl to walk all pages.
Sentiment computation#
The schema above asks the LLM for sentiment directly. Some patterns to make this reliable:
Use specific aspects per category#
A "battery" aspect makes sense for phones; nonsensical for shoes. The aspects you ask for should match the product category. Two ways:
- Pre-curated aspect lists per category: maintain a mapping (
phones → [battery, camera, screen, performance, build],shoes → [comfort, durability, sizing, style]). - Open-ended aspect extraction: ask the LLM to identify aspects from the review without prompting, then post-cluster across reviews.
Option 1 produces more consistent dashboards. Option 2 catches surprises ("nobody complains about the battery, but 30% mention the speaker quality").
A hybrid that works well: ask the LLM for both. Pre-curated aspects always evaluated; "other aspects mentioned" as an additional array.
Anchor the sentiment vocabulary#
Sentiment is a string field. Without anchoring, you'll get "positive", "good", "4/5", "mostly positive", "happy", all meaning roughly the same thing. The example value pins it:
{ "field": "overallSentiment", "type": "string", "example": "positive" }
Better: use an explicit hint that names the allowed values:
{
"field": "overallSentiment",
"type": "string",
"example": "positive",
"hint": "One of: positive, neutral, negative, mixed"
}
The LLM treats the hint as constrained-output guidance. Almost all responses will conform.
Detect mixed sentiment correctly#
A review that says "Great phone but terrible support" is neither positive nor negative. It's mixed. If your schema only allows positive/negative/neutral, the LLM has to pick one and you lose information.
Add mixed as an allowed sentiment value and watch how often it shows up. On consumer products, mixed is often 15-30% of reviews and contains the most actionable signal (the customer wanted to like it but something specific stopped them).
Aggregation: from per-review to insights#
Per-review structured data is the input. The output is dashboards and roadmap items. Patterns:
Aspect-level rolling averages#
Group by aspect, take rolling 30-day average of sentiment-as-numeric (positive=1, neutral=0, negative=-1, mixed=-0.3).
SELECT
aspect,
DATE_TRUNC('week', review_date) AS week,
AVG(CASE
WHEN sentiment = 'positive' THEN 1
WHEN sentiment = 'negative' THEN -1
WHEN sentiment = 'mixed' THEN -0.3
ELSE 0
END) AS avg_sentiment,
COUNT(*) AS n
FROM aspect_sentiments
GROUP BY aspect, week;
Plot. Drop in sentiment for shipping over four weeks? Investigate before customer support is buried.
Issue clustering#
specificIssues is an array of free-text complaints. Cluster across reviews to find the top themes. Use embedding-based clustering (sentence-transformers + HDBSCAN, or LLM-based clustering with a "group similar issues" prompt).
from sentence_transformers import SentenceTransformer
import hdbscan
issues = [...] # list of all "specificIssues" across reviews
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(issues)
clusterer = hdbscan.HDBSCAN(min_cluster_size=5)
clusters = clusterer.fit_predict(embeddings)
Each cluster is a recurring complaint. Surface the top 10 to product weekly.
Feature request leaderboard#
Same pattern on featureRequests. The list, sorted by frequency, is your candidate roadmap.
Competitor comparison#
If you scrape your competitors' reviews too, you can compare sentiment side-by-side. "Our battery sentiment is +0.45; competitor X is +0.12" is actionable. "Their support sentiment beats ours by 30 points; here are the specific complaints driving it" is gold for an ops planning meeting.
Cost math#
Per-review extraction cost on Runo:
- ~1 review-page extraction = 1 request ≈ $0.001 (Pro tier effective rate)
- A typical product has 100-500 reviews across 5-25 pages
- Refresh weekly (only fetch new pages since last scrape)
For 100 products tracked weekly, ~10K review pages/month = $10/month on the Pro tier. That's the entire pipeline cost for monitoring sentiment across 100 products. Compare to a sentiment analysis SaaS (Brandwatch, Sprout, etc.) at $1K-$5K/month.
The build is a few hundred lines of Python plus a Postgres database. The complexity is in the dashboard, not the data layer.
Pitfalls#
A few things that bite people:
Reviewer bias toward extremes#
The 1-star and 5-star reviews are over-represented relative to the actual quality experience. A 4.7 average rating doesn't mean "everyone likes it"; it means "people who liked it AND people who hated it both bothered to write." The neutral middle is silent.
For sentiment monitoring, this is fine (you care about the loud opinions). For predicting customer satisfaction broadly, you need a corrective (e.g., trust survey data more than reviews for tail behavior).
Fake reviews#
Both fake-positive (paid) and fake-negative (competitor sabotage). Filters:
- Strong correlation between high rating + zero verified purchase + low reviewer history = likely fake-positive
- Cluster of similar 1-star reviews on a specific product within a short window from new accounts = likely fake-negative
- Detectable AI-generated text patterns (overly formal, generic, hits all the talking points)
For most use cases, the noise is tolerable. Fake reviews are roughly noise, not directional bias. For high-stakes decisions (like deciding whether to acquire a company based on review sentiment), invest in fake-detection.
Translating across languages#
International products get reviews in many languages. Two approaches:
- Translate to English at extraction time, run sentiment in English. Cheaper, but you lose nuance ("hella good" doesn't translate from English to anything else cleanly either).
- Run sentiment in the original language. Modern LLMs handle 30+ languages competently. Higher cost, better fidelity for nuance.
For an English-speaking product team consuming the dashboard, option 1 usually wins. For monitoring sentiment in markets you specifically operate in, option 2.
Reply threads#
Some platforms let merchants reply to reviews. The reply changes the meaning of the original. A 1-star review with a thoughtful merchant reply ("we shipped you a replacement, sorry for the trouble") signals different things than the same review with no response.
Extract replies as a separate field; surface "review with reply" vs "review without reply" in your dashboard.
What good looks like#
A working sentiment pipeline produces:
- Real-time sentiment dashboard per product, per aspect, with weekly trends
- Top 10 specific complaints per product, with frequency and trend
- Top 10 feature requests per product
- Competitor comparison on the same axes
- Alerting on sentiment drops (e.g. notify product when an aspect drops below a threshold)
- Source links so anyone can drill down to the actual reviews behind a number
The first version takes a couple of weeks to ship if you're handling extraction with a scraping API like Runo and a few weeks longer if you're building the bypass + extraction stack yourself.
TL;DR#
- The schema is the product. Extract aspect-level sentiment, specific issues, feature requests, use case, not just one positive/negative bucket.
- LLM extraction handles aspect-based sentiment natively. Traditional NLP libraries can't separate "battery great, camera terrible" within one paragraph.
- Anchor sentiment vocabulary with
exampleandhintfields. Addmixedas a category; it's where the actionable signal lives. - Discover review URLs via crawl with
follow_pattern; extract per page; aggregate per product. - Cost at 100 products tracked weekly: ~$10/month on the Runo Pro tier. Compare to $1K-$5K/month for sentiment SaaS.
- Cluster issues and feature requests with embeddings + HDBSCAN to surface themes weekly.
- Watch for: reviewer bias toward extremes, fake reviews, translation tradeoffs, merchant replies that change context.