A production-grade Python scraper that collects ratings and customer reviews for local businesses (HVAC, plumbing, restaurants, anything in the Yelp/Angi/HomeAdvisor world) from Google Maps, Yelp, BBB, Angi, and HomeAdvisor — even from data-center IPs that those sites normally block.
input CSV (business name + city + state)
│
▼
review.runner ─── 8 rows in parallel ──┐
│ │
▼ ▼
PostgreSQL reviews table ◄──── 5 site scrapers per row
(UPSERT on row_idx) google_maps · yelp · bbb · angi · homeadvisor
│
▼
Phase 2: retry sweep (only blocked sites; partial UPDATE)
│
▼
Final CSV export at end of run → review_output.csv
Each of the five target sites has its own anti-bot system, and a naive scraper hits a wall within minutes. Here's the actual landscape and the technique that works for each:
| Site | What blocks you | What works in this repo |
|---|---|---|
| Google Maps | rate limits, Selenium fingerprint | SeleniumBase CDP Mode (activate_cdp_mode) — drives Chrome via DevTools Protocol with no navigator.webdriver tells. Verified extracts up to 100 reviews per place. |
| BBB | reCAPTCHA passive widget, IP reputation | CDP mode + direct bbb.org/search discovery. Letter-grade selector span.bpr-letter-grade mapped A+ → 5.0, F → 1.0. |
| Angi | obfuscated SERP ("pseudo text"), Cloudflare on profile pages | DuckDuckGo HTML SERP discovery → direct profile URL → CDP open. Profile pages embed full JSON-LD with aggregateRating and review[]. Robust against class-name churn. |
| HomeAdvisor | dead /c.{name}.{city}.{state}.html URL pattern, Cloudflare |
DuckDuckGo discovery for /rated.<slug>.<id>.html → CDP open → JSON-LD parse. |
| Yelp | DataDome blocks every datacenter IP server-wide (verified across 30+ vectors: classic UC, CDP, curl_cffi w/ 5 browser fingerprints, robots.txt, m.yelp.com, GraphQL, Wayback) | Three-layer fallback: (1) Yelp Academic Open Dataset offline lookup — 160K businesses, 7M reviews; (2) Yelp Fusion API (free 5000/day); (3) terminal NOT_FOUND. |
No proxy is enough on its own. The breakthroughs were per-site:
- Replacing Google site-search discovery (heavily rate-limited from cloud IPs) with DuckDuckGo HTML.
- Switching from CSS selectors to JSON-LD where available — sites change class names every quarter, but their schema.org markup stays stable for SEO.
- Recognising that Yelp simply cannot be scraped from cloud IPs and pivoting to the open dataset + Fusion API.
- 5 sites in parallel per row × N rows in parallel — default
MAX_PARALLEL_ROWS=8×SITES_PER_ROW=5= 40 concurrent Chrome sub-processes. Each row's site work runs in aThreadPoolExecutor; each row runs in anotherThreadPoolExecutor. Tunable inreview/config.py. - PostgreSQL is the source of truth.
reviewstable mirrors the CSV schema, withJSONBcolumns for review arrays so you can later query individual review attributes (yelp_reviews @> '[{"rating":5}]'). The CSV is exported once at the end — so you get both formats. - Crash-safe resume.
review/review_progress.jsonis checkpointed every 5 s. Kill the runner mid-run, restart it, and it picks up exactly where it left off. UPSERT-by-row-idx handles re-writes safely. - Two-phase scraping. Phase 1 walks every row once. Phase 2 retry-sweeps any site that returned
blocked/error(up to 3 passes), updating only those columns and leaving Phase-1 data intact. - Cleanup daemon. Reaps orphan Chromes / Xvfb every 5 min by walking
ppid==1and killing only those — active workers never get touched. - Per-site retry with rotation. SOCKS5 bridge starts in the worker thread, gives each retry a fresh outbound IP, classifies blocks vs not-found vs done.
- SSH-disconnect-safe launcher (
review/run_production.sh) — runs insidetmux, survives logout, supportsstatus/stopsubcommands. - Yelp dataset matcher with HVAC stop-word filtering — drops false matches like "Air Care" → "Prompt Care of Central Florida" while keeping legitimate fuzzy matches like "AAA Heating and Cooling Inc." → "AAA Heating & Cooling".
# 1) Clone + venv
git clone https://github.com/SafeerAbbas624/local-business-reviews-scraper.git
cd local-business-reviews-scraper
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# 2) PostgreSQL (Ubuntu)
sudo apt install -y postgresql tmux
sudo -u postgres createuser -P scraper # set a password
sudo -u postgres createdb -O scraper hvac_reviews
# 3) Configure
cp .env.example .env
$EDITOR .env # fill in POSTGRES_PASSWORD, optional YELP_API_KEY
# 4) Provide your input list of businesses
# (CSV with at least: business_name, business_city, business_state)
cp /path/to/your/businesses.csv hvac_companies_cleaned.csv
# 5) Optional: enable Yelp via the Open Dataset (offline, free)
mkdir -p review/yelp_dataset && cd review/yelp_dataset
curl -L -o yelp_academic_dataset_business.json.gz \
https://archive.org/download/yelp_academic_dataset_review.json/yelp_academic_dataset_business.json.gz
curl -L -o yelp_academic_dataset_review.json.gz \
https://archive.org/download/yelp_academic_dataset_review.json/yelp_academic_dataset_review.json.gz
cd ../..
python -m review.yelp_dataset_loader build hvac_companies_cleaned.csv
# 6) Optional: SOCKS5 / HTTP proxies (recommended for >1000 rows)
echo "socks5://USER:PASS@HOST:PORT" > proxies.txt
# 7) Launch (runs in tmux, survives SSH disconnect)
bash review/run_production.shMonitor:
bash review/run_production.sh status # last 20 log lines + completed-row count
tmux attach -t review-scraper # live view (Ctrl-b d to detach)
tail -f review/review.log # raw log
bash review/run_production.sh stop # graceful SIGTERM → flush progresshvac_companies_cleaned.csv should have one row per business. The runner only needs three columns to find the business; the rest are passed through verbatim into the output:
| column | required | example |
|---|---|---|
business_name |
yes | AAA Heating and Cooling Inc. |
business_city |
yes | Portland |
business_state |
yes | OR |
address |
recommended | 123 Main St |
website_url |
recommended | https://example.com |
keyword, location, google_page, business_description, services_offered, contact_phone, contact_email, supply_location, logo_url |
optional | passed through |
PostgreSQL table reviews (one row per business):
CREATE TABLE reviews (
row_idx INTEGER PRIMARY KEY,
-- input columns (passed through)
business_name TEXT,
business_city TEXT,
business_state TEXT,
-- ... (all input columns mirrored)
-- per-site results
google_maps_rating REAL,
google_maps_reviews JSONB, -- [{"author","rating","date","text"}, ...]
yelp_rating REAL,
yelp_reviews JSONB,
bbb_rating REAL, -- letter grade → numeric (A+=5.0, F=1.0)
angi_rating REAL,
angi_reviews JSONB,
homeadvisor_rating REAL,
homeadvisor_reviews JSONB,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);Each *_reviews JSONB cell is an array of:
[
{"author": "Alice", "rating": 5.0, "date": "2024-08-12",
"text": "Quick service, fair price, would call again."},
...
]Final CSV export uses the same column names; review arrays are JSON-encoded into a single cell (1:1 with the DB).
Everything tunable lives in review/config.py:
| knob | default | what it does |
|---|---|---|
MAX_PARALLEL_ROWS |
8 | rows processed concurrently |
SITES_PER_ROW |
5 | site scrapers in parallel within a row |
SITE_HARD_TIMEOUT |
120 | seconds before killing a hung Chrome subprocess |
ROW_HARD_TIMEOUT |
2400 | hard cap per row (40 min) |
DEFAULT_MAX_RETRIES |
5 | retry attempts per site |
HEAVY_BLOCKER_RETRIES |
10 | for sites known to block more (Yelp/BBB/Angi) |
MAX_REVIEWS_PER_SITE |
100 | cap on reviews stored per business per site |
MEM_CLEANUP_INTERVAL |
300 | seconds between orphan Chrome reaps |
RETRY_MAX_PASSES |
3 | Phase 2 sweep iterations |
For 4-core / 8 GB boxes, drop MAX_PARALLEL_ROWS to 2.
For 32-core / 128 GB boxes with paid residential proxies, push it to 16.
local-business-reviews-scraper/
├── review/
│ ├── runner.py # main entrypoint: process_rows() with thread-pool
│ ├── browser.py # spawns Chrome subprocess per (row, site) attempt
│ ├── base_scraper.py # abstract base — search() + scrape() contract
│ ├── block_detect.py # CAPTCHA / Cloudflare interstitial detection
│ ├── cleanup_daemon.py # 5-min orphan-Chrome reaper
│ ├── config.py # all tunable knobs
│ ├── csv_io.py # CSV reader (input list)
│ ├── db.py # PostgreSQL: schema, UPSERT, partial-update, export
│ ├── progress.py # crash-safe per-row state on disk
│ ├── proxies.py # proxies.txt loader + thread-safe rotation
│ ├── yelp_dataset_loader.py # Yelp open-dataset matcher (offline)
│ ├── run_production.sh # tmux launcher + status/stop subcommands
│ └── sites/
│ ├── _common.py # CDP helpers, JSON-LD parser, DDG search
│ ├── google_maps.py # SeleniumBase UC + tabs/scroll for review cards
│ ├── yelp.py # 3-layer fallback (dataset → API → not_found)
│ ├── bbb.py # CDP + letter-grade selector
│ ├── angi.py # DDG discovery + JSON-LD
│ └── homeadvisor.py # DDG discovery + JSON-LD
├── socks_bridge.py # local HTTP-to-SOCKS5 bridge (Chrome lacks SOCKS5 auth)
├── requirements.txt
├── .env.example
└── README.md
This project exists because none of the standard scraping advice ("just use undetected-chromedriver", "rotate proxies") is enough for these sites in 2025. Specifics:
Classic selenium and undetected-chromedriver ship with detectable JavaScript hooks. seleniumbase's activate_cdp_mode() drives Chrome via the Chrome DevTools Protocol — same as Chrome itself talks to its devtools — with none of the WebDriver tells. CDP mode bypasses Cloudflare's bot challenge on Angi profile pages where uc_open_with_reconnect() got 403.
Yelp returns HTTP 403 even on /robots.txt from any datacenter IP. We confirmed this across:
- Selenium UC + CDP mode
curl_cffiimpersonating Chrome 124 / 120 / 110 / Safari / Edge- Direct connection (no proxy)
- Wayback Machine (no snapshots for the URLs we needed)
- Bing/DDG SERP scraping (don't surface yelp.com/biz links)
- Google Maps "Reviews from the web" (Google de-emphasizes Yelp on its own properties)
The only working paths from a cloud IP are the Yelp Fusion API (free 5000/day) and the Yelp Open Dataset (160K businesses, 7M reviews, regional coverage). This repo wires both.
Angi and HomeAdvisor change their CSS class names roughly every quarter (Reviews_reviewsContainer__1DSUP becomes Reviews_reviewsContainer__abc12). Their schema.org JSON-LD blocks stay stable because they're SEO-load-bearing — Google needs them. We parse those instead. See review/sites/_common.py:jsonld_extract_rating_and_reviews.
Google's site:angi.com "biz name" city state works once or twice per IP, then it's recaptcha-walled for an hour. DuckDuckGo's HTML site (html.duckduckgo.com/html/?q=…) tolerates many more queries from the same IP and indexes the same pages. We use it for Angi, HomeAdvisor, and Yelp profile-URL discovery. See _common.py:ddg_html_search.
Empirical numbers on the test box (18-core / 94 GB, single rotating SOCKS5 proxy):
| concurrency | rows / hour | total throughput |
|---|---|---|
MAX_PARALLEL_ROWS=4 |
36.7 | 880 / day |
MAX_PARALLEL_ROWS=8 |
61.6 | 1480 / day |
Per-row breakdown (typical, with proxy):
| site | typical wall time | failure rate |
|---|---|---|
| google_maps | 60 – 240 s (slowest tail) | ~10 % time out |
| yelp (dataset path) | ~5 s | n/a — offline |
| bbb | 30 – 90 s | ~0 % (after CDP mode) |
| angi | 20 – 60 s | ~5 % (Cloudflare on profile) |
| homeadvisor | 20 – 60 s | ~0 % |
Ceilings and what to do about them:
- Going above
MAX_PARALLEL_ROWS=8on a single rotating proxy account starts to trip per-account rate limits → retry rate climbs and net throughput plateaus. Add a second proxy account or upgrade to residential proxies before pushing higher. - Roughly 30 % of nationwide rows are not-found on
homeadvisorandbbb— that's correct (most small HVAC businesses don't have BBB/HA profiles), not a scraper bug.
-- progress so far
SELECT COUNT(*) FROM reviews;
-- top-rated businesses by Google Maps in TX
SELECT business_name, business_city, google_maps_rating
FROM reviews
WHERE business_state = 'TX' AND google_maps_rating IS NOT NULL
ORDER BY google_maps_rating DESC LIMIT 20;
-- find businesses with 5-star Yelp reviews (querying nested JSONB)
SELECT business_name, jsonb_array_length(yelp_reviews) AS n_reviews
FROM reviews
WHERE yelp_reviews @> '[{"rating": 5.0}]';
-- per-site coverage
SELECT
COUNT(*) FILTER (WHERE google_maps_rating IS NOT NULL) AS gm_done,
COUNT(*) FILTER (WHERE bbb_rating IS NOT NULL) AS bbb_done,
COUNT(*) FILTER (WHERE angi_rating IS NOT NULL) AS angi_done,
COUNT(*) FILTER (WHERE homeadvisor_rating IS NOT NULL) AS ha_done,
COUNT(*) FILTER (WHERE yelp_rating IS NOT NULL) AS yelp_done
FROM reviews;PRs welcome. The parts most likely to need ongoing maintenance:
review/sites/google_maps.py— Google Maps DOM changes occasionally; the tab-click + scroll loop may need new selectors.review/sites/_common.py:cdp_is_blocked— block-page heuristics drift.- New sites: subclass
SiteScraper, register it inreview/sites/__init__.py:SITE_REGISTRY, add columns inreview/db.py.
MIT — do whatever you want, just don't blame me when Yelp adds new defenses next quarter.