Skip to content

SafeerAbbas624/local-business-reviews-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local Business Reviews Scraper

A production-grade Python scraper that collects ratings and customer reviews for local businesses (HVAC, plumbing, restaurants, anything in the Yelp/Angi/HomeAdvisor world) from Google Maps, Yelp, BBB, Angi, and HomeAdvisor — even from data-center IPs that those sites normally block.

Python 3.10+ PostgreSQL License: MIT

input CSV (business name + city + state)
        │
        ▼
   review.runner   ─── 8 rows in parallel ──┐
        │                                    │
        ▼                                    ▼
   PostgreSQL  reviews table   ◄──── 5 site scrapers per row
   (UPSERT on row_idx)              google_maps · yelp · bbb · angi · homeadvisor
        │
        ▼
   Phase 2: retry sweep (only blocked sites; partial UPDATE)
        │
        ▼
   Final CSV export at end of run  →  review_output.csv

What's hard about this and how this repo solves it

Each of the five target sites has its own anti-bot system, and a naive scraper hits a wall within minutes. Here's the actual landscape and the technique that works for each:

Site What blocks you What works in this repo
Google Maps rate limits, Selenium fingerprint SeleniumBase CDP Mode (activate_cdp_mode) — drives Chrome via DevTools Protocol with no navigator.webdriver tells. Verified extracts up to 100 reviews per place.
BBB reCAPTCHA passive widget, IP reputation CDP mode + direct bbb.org/search discovery. Letter-grade selector span.bpr-letter-grade mapped A+ → 5.0, F → 1.0.
Angi obfuscated SERP ("pseudo text"), Cloudflare on profile pages DuckDuckGo HTML SERP discovery → direct profile URL → CDP open. Profile pages embed full JSON-LD with aggregateRating and review[]. Robust against class-name churn.
HomeAdvisor dead /c.{name}.{city}.{state}.html URL pattern, Cloudflare DuckDuckGo discovery for /rated.<slug>.<id>.html → CDP open → JSON-LD parse.
Yelp DataDome blocks every datacenter IP server-wide (verified across 30+ vectors: classic UC, CDP, curl_cffi w/ 5 browser fingerprints, robots.txt, m.yelp.com, GraphQL, Wayback) Three-layer fallback: (1) Yelp Academic Open Dataset offline lookup — 160K businesses, 7M reviews; (2) Yelp Fusion API (free 5000/day); (3) terminal NOT_FOUND.

No proxy is enough on its own. The breakthroughs were per-site:

  • Replacing Google site-search discovery (heavily rate-limited from cloud IPs) with DuckDuckGo HTML.
  • Switching from CSS selectors to JSON-LD where available — sites change class names every quarter, but their schema.org markup stays stable for SEO.
  • Recognising that Yelp simply cannot be scraped from cloud IPs and pivoting to the open dataset + Fusion API.

Features

  • 5 sites in parallel per row × N rows in parallel — default MAX_PARALLEL_ROWS=8 × SITES_PER_ROW=5 = 40 concurrent Chrome sub-processes. Each row's site work runs in a ThreadPoolExecutor; each row runs in another ThreadPoolExecutor. Tunable in review/config.py.
  • PostgreSQL is the source of truth. reviews table mirrors the CSV schema, with JSONB columns for review arrays so you can later query individual review attributes (yelp_reviews @> '[{"rating":5}]'). The CSV is exported once at the end — so you get both formats.
  • Crash-safe resume. review/review_progress.json is checkpointed every 5 s. Kill the runner mid-run, restart it, and it picks up exactly where it left off. UPSERT-by-row-idx handles re-writes safely.
  • Two-phase scraping. Phase 1 walks every row once. Phase 2 retry-sweeps any site that returned blocked / error (up to 3 passes), updating only those columns and leaving Phase-1 data intact.
  • Cleanup daemon. Reaps orphan Chromes / Xvfb every 5 min by walking ppid==1 and killing only those — active workers never get touched.
  • Per-site retry with rotation. SOCKS5 bridge starts in the worker thread, gives each retry a fresh outbound IP, classifies blocks vs not-found vs done.
  • SSH-disconnect-safe launcher (review/run_production.sh) — runs inside tmux, survives logout, supports status / stop subcommands.
  • Yelp dataset matcher with HVAC stop-word filtering — drops false matches like "Air Care" → "Prompt Care of Central Florida" while keeping legitimate fuzzy matches like "AAA Heating and Cooling Inc." → "AAA Heating & Cooling".

Quick start

# 1) Clone + venv
git clone https://github.com/SafeerAbbas624/local-business-reviews-scraper.git
cd local-business-reviews-scraper
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 2) PostgreSQL (Ubuntu)
sudo apt install -y postgresql tmux
sudo -u postgres createuser -P scraper            # set a password
sudo -u postgres createdb -O scraper hvac_reviews

# 3) Configure
cp .env.example .env
$EDITOR .env                                       # fill in POSTGRES_PASSWORD, optional YELP_API_KEY

# 4) Provide your input list of businesses
#    (CSV with at least: business_name, business_city, business_state)
cp /path/to/your/businesses.csv hvac_companies_cleaned.csv

# 5) Optional: enable Yelp via the Open Dataset (offline, free)
mkdir -p review/yelp_dataset && cd review/yelp_dataset
curl -L -o yelp_academic_dataset_business.json.gz \
   https://archive.org/download/yelp_academic_dataset_review.json/yelp_academic_dataset_business.json.gz
curl -L -o yelp_academic_dataset_review.json.gz \
   https://archive.org/download/yelp_academic_dataset_review.json/yelp_academic_dataset_review.json.gz
cd ../..
python -m review.yelp_dataset_loader build hvac_companies_cleaned.csv

# 6) Optional: SOCKS5 / HTTP proxies (recommended for >1000 rows)
echo "socks5://USER:PASS@HOST:PORT" > proxies.txt

# 7) Launch (runs in tmux, survives SSH disconnect)
bash review/run_production.sh

Monitor:

bash review/run_production.sh status      # last 20 log lines + completed-row count
tmux attach -t review-scraper             # live view (Ctrl-b d to detach)
tail -f review/review.log                 # raw log
bash review/run_production.sh stop        # graceful SIGTERM → flush progress

Input format

hvac_companies_cleaned.csv should have one row per business. The runner only needs three columns to find the business; the rest are passed through verbatim into the output:

column required example
business_name yes AAA Heating and Cooling Inc.
business_city yes Portland
business_state yes OR
address recommended 123 Main St
website_url recommended https://example.com
keyword, location, google_page, business_description, services_offered, contact_phone, contact_email, supply_location, logo_url optional passed through

Output schema

PostgreSQL table reviews (one row per business):

CREATE TABLE reviews (
    row_idx              INTEGER PRIMARY KEY,
    -- input columns (passed through)
    business_name        TEXT,
    business_city        TEXT,
    business_state       TEXT,
    -- ... (all input columns mirrored)

    -- per-site results
    google_maps_rating   REAL,
    google_maps_reviews  JSONB,    -- [{"author","rating","date","text"}, ...]
    yelp_rating          REAL,
    yelp_reviews         JSONB,
    bbb_rating           REAL,     -- letter grade → numeric (A+=5.0, F=1.0)
    angi_rating          REAL,
    angi_reviews         JSONB,
    homeadvisor_rating   REAL,
    homeadvisor_reviews  JSONB,

    created_at           TIMESTAMPTZ DEFAULT now(),
    updated_at           TIMESTAMPTZ DEFAULT now()
);

Each *_reviews JSONB cell is an array of:

[
  {"author": "Alice", "rating": 5.0, "date": "2024-08-12",
   "text": "Quick service, fair price, would call again."},
  ...
]

Final CSV export uses the same column names; review arrays are JSON-encoded into a single cell (1:1 with the DB).


Configuration cheat-sheet

Everything tunable lives in review/config.py:

knob default what it does
MAX_PARALLEL_ROWS 8 rows processed concurrently
SITES_PER_ROW 5 site scrapers in parallel within a row
SITE_HARD_TIMEOUT 120 seconds before killing a hung Chrome subprocess
ROW_HARD_TIMEOUT 2400 hard cap per row (40 min)
DEFAULT_MAX_RETRIES 5 retry attempts per site
HEAVY_BLOCKER_RETRIES 10 for sites known to block more (Yelp/BBB/Angi)
MAX_REVIEWS_PER_SITE 100 cap on reviews stored per business per site
MEM_CLEANUP_INTERVAL 300 seconds between orphan Chrome reaps
RETRY_MAX_PASSES 3 Phase 2 sweep iterations

For 4-core / 8 GB boxes, drop MAX_PARALLEL_ROWS to 2. For 32-core / 128 GB boxes with paid residential proxies, push it to 16.


Project layout

local-business-reviews-scraper/
├── review/
│   ├── runner.py               # main entrypoint: process_rows() with thread-pool
│   ├── browser.py              # spawns Chrome subprocess per (row, site) attempt
│   ├── base_scraper.py         # abstract base — search() + scrape() contract
│   ├── block_detect.py         # CAPTCHA / Cloudflare interstitial detection
│   ├── cleanup_daemon.py       # 5-min orphan-Chrome reaper
│   ├── config.py               # all tunable knobs
│   ├── csv_io.py               # CSV reader (input list)
│   ├── db.py                   # PostgreSQL: schema, UPSERT, partial-update, export
│   ├── progress.py             # crash-safe per-row state on disk
│   ├── proxies.py              # proxies.txt loader + thread-safe rotation
│   ├── yelp_dataset_loader.py  # Yelp open-dataset matcher (offline)
│   ├── run_production.sh       # tmux launcher + status/stop subcommands
│   └── sites/
│       ├── _common.py          # CDP helpers, JSON-LD parser, DDG search
│       ├── google_maps.py      # SeleniumBase UC + tabs/scroll for review cards
│       ├── yelp.py             # 3-layer fallback (dataset → API → not_found)
│       ├── bbb.py              # CDP + letter-grade selector
│       ├── angi.py             # DDG discovery + JSON-LD
│       └── homeadvisor.py      # DDG discovery + JSON-LD
├── socks_bridge.py             # local HTTP-to-SOCKS5 bridge (Chrome lacks SOCKS5 auth)
├── requirements.txt
├── .env.example
└── README.md

Anti-bot notes (the techniques)

This project exists because none of the standard scraping advice ("just use undetected-chromedriver", "rotate proxies") is enough for these sites in 2025. Specifics:

1. Selenium fingerprinting → SeleniumBase CDP Mode

Classic selenium and undetected-chromedriver ship with detectable JavaScript hooks. seleniumbase's activate_cdp_mode() drives Chrome via the Chrome DevTools Protocol — same as Chrome itself talks to its devtools — with none of the WebDriver tells. CDP mode bypasses Cloudflare's bot challenge on Angi profile pages where uc_open_with_reconnect() got 403.

2. Datacenter-IP blocks → DataDome / Cloudflare

Yelp returns HTTP 403 even on /robots.txt from any datacenter IP. We confirmed this across:

  • Selenium UC + CDP mode
  • curl_cffi impersonating Chrome 124 / 120 / 110 / Safari / Edge
  • Direct connection (no proxy)
  • Wayback Machine (no snapshots for the URLs we needed)
  • Bing/DDG SERP scraping (don't surface yelp.com/biz links)
  • Google Maps "Reviews from the web" (Google de-emphasizes Yelp on its own properties)

The only working paths from a cloud IP are the Yelp Fusion API (free 5000/day) and the Yelp Open Dataset (160K businesses, 7M reviews, regional coverage). This repo wires both.

3. Class-hash churn → JSON-LD over CSS selectors

Angi and HomeAdvisor change their CSS class names roughly every quarter (Reviews_reviewsContainer__1DSUP becomes Reviews_reviewsContainer__abc12). Their schema.org JSON-LD blocks stay stable because they're SEO-load-bearing — Google needs them. We parse those instead. See review/sites/_common.py:jsonld_extract_rating_and_reviews.

4. SERP discovery → DuckDuckGo HTML

Google's site:angi.com "biz name" city state works once or twice per IP, then it's recaptcha-walled for an hour. DuckDuckGo's HTML site (html.duckduckgo.com/html/?q=…) tolerates many more queries from the same IP and indexes the same pages. We use it for Angi, HomeAdvisor, and Yelp profile-URL discovery. See _common.py:ddg_html_search.


Performance characteristics

Empirical numbers on the test box (18-core / 94 GB, single rotating SOCKS5 proxy):

concurrency rows / hour total throughput
MAX_PARALLEL_ROWS=4 36.7 880 / day
MAX_PARALLEL_ROWS=8 61.6 1480 / day

Per-row breakdown (typical, with proxy):

site typical wall time failure rate
google_maps 60 – 240 s (slowest tail) ~10 % time out
yelp (dataset path) ~5 s n/a — offline
bbb 30 – 90 s ~0 % (after CDP mode)
angi 20 – 60 s ~5 % (Cloudflare on profile)
homeadvisor 20 – 60 s ~0 %

Ceilings and what to do about them:

  • Going above MAX_PARALLEL_ROWS=8 on a single rotating proxy account starts to trip per-account rate limits → retry rate climbs and net throughput plateaus. Add a second proxy account or upgrade to residential proxies before pushing higher.
  • Roughly 30 % of nationwide rows are not-found on homeadvisor and bbb — that's correct (most small HVAC businesses don't have BBB/HA profiles), not a scraper bug.

Useful queries

-- progress so far
SELECT COUNT(*) FROM reviews;

-- top-rated businesses by Google Maps in TX
SELECT business_name, business_city, google_maps_rating
FROM reviews
WHERE business_state = 'TX' AND google_maps_rating IS NOT NULL
ORDER BY google_maps_rating DESC LIMIT 20;

-- find businesses with 5-star Yelp reviews (querying nested JSONB)
SELECT business_name, jsonb_array_length(yelp_reviews) AS n_reviews
FROM reviews
WHERE yelp_reviews @> '[{"rating": 5.0}]';

-- per-site coverage
SELECT
  COUNT(*) FILTER (WHERE google_maps_rating IS NOT NULL) AS gm_done,
  COUNT(*) FILTER (WHERE bbb_rating IS NOT NULL)         AS bbb_done,
  COUNT(*) FILTER (WHERE angi_rating IS NOT NULL)        AS angi_done,
  COUNT(*) FILTER (WHERE homeadvisor_rating IS NOT NULL) AS ha_done,
  COUNT(*) FILTER (WHERE yelp_rating IS NOT NULL)        AS yelp_done
FROM reviews;

Contributing

PRs welcome. The parts most likely to need ongoing maintenance:

  • review/sites/google_maps.py — Google Maps DOM changes occasionally; the tab-click + scroll loop may need new selectors.
  • review/sites/_common.py:cdp_is_blocked — block-page heuristics drift.
  • New sites: subclass SiteScraper, register it in review/sites/__init__.py:SITE_REGISTRY, add columns in review/db.py.

License

MIT — do whatever you want, just don't blame me when Yelp adds new defenses next quarter.

About

Multi-site review scraper (Google Maps, Yelp, BBB, Angi, HomeAdvisor) — Python + SeleniumBase CDP mode + PostgreSQL + Yelp Open Dataset offline lookup. Bypasses DataDome and Cloudflare via curl_cffi TLS fingerprint impersonation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors