Local Business Reviews Scraper

A production-grade Python scraper that collects ratings and customer reviews for local businesses (HVAC, plumbing, restaurants, anything in the Yelp/Angi/HomeAdvisor world) from Google Maps, Yelp, BBB, Angi, and HomeAdvisor — even from data-center IPs that those sites normally block.

input CSV (business name + city + state)
        │
        ▼
   review.runner   ─── 8 rows in parallel ──┐
        │                                    │
        ▼                                    ▼
   PostgreSQL  reviews table   ◄──── 5 site scrapers per row
   (UPSERT on row_idx)              google_maps · yelp · bbb · angi · homeadvisor
        │
        ▼
   Phase 2: retry sweep (only blocked sites; partial UPDATE)
        │
        ▼
   Final CSV export at end of run  →  review_output.csv

What's hard about this and how this repo solves it

Each of the five target sites has its own anti-bot system, and a naive scraper hits a wall within minutes. Here's the actual landscape and the technique that works for each:

Site	What blocks you	What works in this repo
Google Maps	rate limits, Selenium fingerprint	SeleniumBase CDP Mode (`activate_cdp_mode`) — drives Chrome via DevTools Protocol with no `navigator.webdriver` tells. Verified extracts up to 100 reviews per place.
BBB	reCAPTCHA passive widget, IP reputation	CDP mode + direct `bbb.org/search` discovery. Letter-grade selector `span.bpr-letter-grade` mapped A+ → 5.0, F → 1.0.
Angi	obfuscated SERP ("pseudo text"), Cloudflare on profile pages	DuckDuckGo HTML SERP discovery → direct profile URL → CDP open. Profile pages embed full JSON-LD with `aggregateRating` and `review[]`. Robust against class-name churn.
HomeAdvisor	dead `/c.{name}.{city}.{state}.html` URL pattern, Cloudflare	DuckDuckGo discovery for `/rated.<slug>.<id>.html` → CDP open → JSON-LD parse.
Yelp	DataDome blocks every datacenter IP server-wide (verified across 30+ vectors: classic UC, CDP, curl_cffi w/ 5 browser fingerprints, robots.txt, m.yelp.com, GraphQL, Wayback)	Three-layer fallback: (1) Yelp Academic Open Dataset offline lookup — 160K businesses, 7M reviews; (2) Yelp Fusion API (free 5000/day); (3) terminal NOT_FOUND.

No proxy is enough on its own. The breakthroughs were per-site:

Replacing Google site-search discovery (heavily rate-limited from cloud IPs) with DuckDuckGo HTML.
Switching from CSS selectors to JSON-LD where available — sites change class names every quarter, but their schema.org markup stays stable for SEO.
Recognising that Yelp simply cannot be scraped from cloud IPs and pivoting to the open dataset + Fusion API.

Features

5 sites in parallel per row × N rows in parallel — default MAX_PARALLEL_ROWS=8 × SITES_PER_ROW=5 = 40 concurrent Chrome sub-processes. Each row's site work runs in a ThreadPoolExecutor; each row runs in another ThreadPoolExecutor. Tunable in review/config.py.
PostgreSQL is the source of truth. reviews table mirrors the CSV schema, with JSONB columns for review arrays so you can later query individual review attributes (yelp_reviews @> '[{"rating":5}]'). The CSV is exported once at the end — so you get both formats.
Crash-safe resume. review/review_progress.json is checkpointed every 5 s. Kill the runner mid-run, restart it, and it picks up exactly where it left off. UPSERT-by-row-idx handles re-writes safely.
Two-phase scraping. Phase 1 walks every row once. Phase 2 retry-sweeps any site that returned blocked / error (up to 3 passes), updating only those columns and leaving Phase-1 data intact.
Cleanup daemon. Reaps orphan Chromes / Xvfb every 5 min by walking ppid==1 and killing only those — active workers never get touched.
Per-site retry with rotation. SOCKS5 bridge starts in the worker thread, gives each retry a fresh outbound IP, classifies blocks vs not-found vs done.
SSH-disconnect-safe launcher (review/run_production.sh) — runs inside tmux, survives logout, supports status / stop subcommands.
Yelp dataset matcher with HVAC stop-word filtering — drops false matches like "Air Care" → "Prompt Care of Central Florida" while keeping legitimate fuzzy matches like "AAA Heating and Cooling Inc." → "AAA Heating & Cooling".

Quick start

# 1) Clone + venv
git clone https://github.com/SafeerAbbas624/local-business-reviews-scraper.git
cd local-business-reviews-scraper
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 2) PostgreSQL (Ubuntu)
sudo apt install -y postgresql tmux
sudo -u postgres createuser -P scraper            # set a password
sudo -u postgres createdb -O scraper hvac_reviews

# 3) Configure
cp .env.example .env
$EDITOR .env                                       # fill in POSTGRES_PASSWORD, optional YELP_API_KEY

# 4) Provide your input list of businesses
#    (CSV with at least: business_name, business_city, business_state)
cp /path/to/your/businesses.csv hvac_companies_cleaned.csv

# 5) Optional: enable Yelp via the Open Dataset (offline, free)
mkdir -p review/yelp_dataset && cd review/yelp_dataset
curl -L -o yelp_academic_dataset_business.json.gz \
   https://archive.org/download/yelp_academic_dataset_review.json/yelp_academic_dataset_business.json.gz
curl -L -o yelp_academic_dataset_review.json.gz \
   https://archive.org/download/yelp_academic_dataset_review.json/yelp_academic_dataset_review.json.gz
cd ../..
python -m review.yelp_dataset_loader build hvac_companies_cleaned.csv

# 6) Optional: SOCKS5 / HTTP proxies (recommended for >1000 rows)
echo "socks5://USER:PASS@HOST:PORT" > proxies.txt

# 7) Launch (runs in tmux, survives SSH disconnect)
bash review/run_production.sh

Monitor:

bash review/run_production.sh status      # last 20 log lines + completed-row count
tmux attach -t review-scraper             # live view (Ctrl-b d to detach)
tail -f review/review.log                 # raw log
bash review/run_production.sh stop        # graceful SIGTERM → flush progress

Input format

hvac_companies_cleaned.csv should have one row per business. The runner only needs three columns to find the business; the rest are passed through verbatim into the output:

column	required	example
`business_name`	yes	`AAA Heating and Cooling Inc.`
`business_city`	yes	`Portland`
`business_state`	yes	`OR`
`address`	recommended	`123 Main St`
`website_url`	recommended	`https://example.com`
`keyword`, `location`, `google_page`, `business_description`, `services_offered`, `contact_phone`, `contact_email`, `supply_location`, `logo_url`	optional	passed through

Output schema

PostgreSQL table reviews (one row per business):

CREATE TABLE reviews (
    row_idx              INTEGER PRIMARY KEY,
    -- input columns (passed through)
    business_name        TEXT,
    business_city        TEXT,
    business_state       TEXT,
    -- ... (all input columns mirrored)

    -- per-site results
    google_maps_rating   REAL,
    google_maps_reviews  JSONB,    -- [{"author","rating","date","text"}, ...]
    yelp_rating          REAL,
    yelp_reviews         JSONB,
    bbb_rating           REAL,     -- letter grade → numeric (A+=5.0, F=1.0)
    angi_rating          REAL,
    angi_reviews         JSONB,
    homeadvisor_rating   REAL,
    homeadvisor_reviews  JSONB,

    created_at           TIMESTAMPTZ DEFAULT now(),
    updated_at           TIMESTAMPTZ DEFAULT now()
);

Each *_reviews JSONB cell is an array of:

[
  {"author": "Alice", "rating": 5.0, "date": "2024-08-12",
   "text": "Quick service, fair price, would call again."},
  ...
]

Final CSV export uses the same column names; review arrays are JSON-encoded into a single cell (1:1 with the DB).

Configuration cheat-sheet

Everything tunable lives in review/config.py:

knob	default	what it does
`MAX_PARALLEL_ROWS`	8	rows processed concurrently
`SITES_PER_ROW`	5	site scrapers in parallel within a row
`SITE_HARD_TIMEOUT`	120	seconds before killing a hung Chrome subprocess
`ROW_HARD_TIMEOUT`	2400	hard cap per row (40 min)
`DEFAULT_MAX_RETRIES`	5	retry attempts per site
`HEAVY_BLOCKER_RETRIES`	10	for sites known to block more (Yelp/BBB/Angi)
`MAX_REVIEWS_PER_SITE`	100	cap on reviews stored per business per site
`MEM_CLEANUP_INTERVAL`	300	seconds between orphan Chrome reaps
`RETRY_MAX_PASSES`	3	Phase 2 sweep iterations

For 4-core / 8 GB boxes, drop MAX_PARALLEL_ROWS to 2. For 32-core / 128 GB boxes with paid residential proxies, push it to 16.

Project layout

local-business-reviews-scraper/
├── review/
│   ├── runner.py               # main entrypoint: process_rows() with thread-pool
│   ├── browser.py              # spawns Chrome subprocess per (row, site) attempt
│   ├── base_scraper.py         # abstract base — search() + scrape() contract
│   ├── block_detect.py         # CAPTCHA / Cloudflare interstitial detection
│   ├── cleanup_daemon.py       # 5-min orphan-Chrome reaper
│   ├── config.py               # all tunable knobs
│   ├── csv_io.py               # CSV reader (input list)
│   ├── db.py                   # PostgreSQL: schema, UPSERT, partial-update, export
│   ├── progress.py             # crash-safe per-row state on disk
│   ├── proxies.py              # proxies.txt loader + thread-safe rotation
│   ├── yelp_dataset_loader.py  # Yelp open-dataset matcher (offline)
│   ├── run_production.sh       # tmux launcher + status/stop subcommands
│   └── sites/
│       ├── _common.py          # CDP helpers, JSON-LD parser, DDG search
│       ├── google_maps.py      # SeleniumBase UC + tabs/scroll for review cards
│       ├── yelp.py             # 3-layer fallback (dataset → API → not_found)
│       ├── bbb.py              # CDP + letter-grade selector
│       ├── angi.py             # DDG discovery + JSON-LD
│       └── homeadvisor.py      # DDG discovery + JSON-LD
├── socks_bridge.py             # local HTTP-to-SOCKS5 bridge (Chrome lacks SOCKS5 auth)
├── requirements.txt
├── .env.example
└── README.md

Anti-bot notes (the techniques)

This project exists because none of the standard scraping advice ("just use undetected-chromedriver", "rotate proxies") is enough for these sites in 2025. Specifics:

1. Selenium fingerprinting → SeleniumBase CDP Mode

Classic selenium and undetected-chromedriver ship with detectable JavaScript hooks. seleniumbase's activate_cdp_mode() drives Chrome via the Chrome DevTools Protocol — same as Chrome itself talks to its devtools — with none of the WebDriver tells. CDP mode bypasses Cloudflare's bot challenge on Angi profile pages where uc_open_with_reconnect() got 403.

2. Datacenter-IP blocks → DataDome / Cloudflare

Yelp returns HTTP 403 even on /robots.txt from any datacenter IP. We confirmed this across:

Selenium UC + CDP mode
curl_cffi impersonating Chrome 124 / 120 / 110 / Safari / Edge
Direct connection (no proxy)
Wayback Machine (no snapshots for the URLs we needed)
Bing/DDG SERP scraping (don't surface yelp.com/biz links)
Google Maps "Reviews from the web" (Google de-emphasizes Yelp on its own properties)

The only working paths from a cloud IP are the Yelp Fusion API (free 5000/day) and the Yelp Open Dataset (160K businesses, 7M reviews, regional coverage). This repo wires both.

3. Class-hash churn → JSON-LD over CSS selectors

Angi and HomeAdvisor change their CSS class names roughly every quarter (Reviews_reviewsContainer__1DSUP becomes Reviews_reviewsContainer__abc12). Their schema.org JSON-LD blocks stay stable because they're SEO-load-bearing — Google needs them. We parse those instead. See review/sites/_common.py:jsonld_extract_rating_and_reviews.

4. SERP discovery → DuckDuckGo HTML

Google's site:angi.com "biz name" city state works once or twice per IP, then it's recaptcha-walled for an hour. DuckDuckGo's HTML site (html.duckduckgo.com/html/?q=…) tolerates many more queries from the same IP and indexes the same pages. We use it for Angi, HomeAdvisor, and Yelp profile-URL discovery. See _common.py:ddg_html_search.

Performance characteristics

Empirical numbers on the test box (18-core / 94 GB, single rotating SOCKS5 proxy):

concurrency	rows / hour	total throughput
`MAX_PARALLEL_ROWS=4`	36.7	880 / day
`MAX_PARALLEL_ROWS=8`	61.6	1480 / day

Per-row breakdown (typical, with proxy):

site	typical wall time	failure rate
google_maps	60 – 240 s (slowest tail)	~10 % time out
yelp (dataset path)	~5 s	n/a — offline
bbb	30 – 90 s	~0 % (after CDP mode)
angi	20 – 60 s	~5 % (Cloudflare on profile)
homeadvisor	20 – 60 s	~0 %

Ceilings and what to do about them:

Going above MAX_PARALLEL_ROWS=8 on a single rotating proxy account starts to trip per-account rate limits → retry rate climbs and net throughput plateaus. Add a second proxy account or upgrade to residential proxies before pushing higher.
Roughly 30 % of nationwide rows are not-found on homeadvisor and bbb — that's correct (most small HVAC businesses don't have BBB/HA profiles), not a scraper bug.

Useful queries

-- progress so far
SELECT COUNT(*) FROM reviews;

-- top-rated businesses by Google Maps in TX
SELECT business_name, business_city, google_maps_rating
FROM reviews
WHERE business_state = 'TX' AND google_maps_rating IS NOT NULL
ORDER BY google_maps_rating DESC LIMIT 20;

-- find businesses with 5-star Yelp reviews (querying nested JSONB)
SELECT business_name, jsonb_array_length(yelp_reviews) AS n_reviews
FROM reviews
WHERE yelp_reviews @> '[{"rating": 5.0}]';

-- per-site coverage
SELECT
  COUNT(*) FILTER (WHERE google_maps_rating IS NOT NULL) AS gm_done,
  COUNT(*) FILTER (WHERE bbb_rating IS NOT NULL)         AS bbb_done,
  COUNT(*) FILTER (WHERE angi_rating IS NOT NULL)        AS angi_done,
  COUNT(*) FILTER (WHERE homeadvisor_rating IS NOT NULL) AS ha_done,
  COUNT(*) FILTER (WHERE yelp_rating IS NOT NULL)        AS yelp_done
FROM reviews;

Contributing

PRs welcome. The parts most likely to need ongoing maintenance:

review/sites/google_maps.py — Google Maps DOM changes occasionally; the tab-click + scroll loop may need new selectors.
review/sites/_common.py:cdp_is_blocked — block-page heuristics drift.
New sites: subclass SiteScraper, register it in review/sites/__init__.py:SITE_REGISTRY, add columns in review/db.py.

License

MIT — do whatever you want, just don't blame me when Yelp adds new defenses next quarter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local Business Reviews Scraper

What's hard about this and how this repo solves it

Features

Quick start

Input format

Output schema

Configuration cheat-sheet

Project layout

Anti-bot notes (the techniques)

1. Selenium fingerprinting → SeleniumBase CDP Mode

2. Datacenter-IP blocks → DataDome / Cloudflare

3. Class-hash churn → JSON-LD over CSS selectors

4. SERP discovery → DuckDuckGo HTML

Performance characteristics

Useful queries

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
review		review
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
socks_bridge.py		socks_bridge.py

Folders and files

Latest commit

History

Repository files navigation

Local Business Reviews Scraper

What's hard about this and how this repo solves it

Features

Quick start

Input format

Output schema

Configuration cheat-sheet

Project layout

Anti-bot notes (the techniques)

1. Selenium fingerprinting → SeleniumBase CDP Mode

2. Datacenter-IP blocks → DataDome / Cloudflare

3. Class-hash churn → JSON-LD over CSS selectors

4. SERP discovery → DuckDuckGo HTML

Performance characteristics

Useful queries

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages