Skip to content

feat: add 9 scrapers (52 → 61 stable sources) + docs support table#32

Merged
okkymabruri merged 6 commits into
mainfrom
feature/source-expansion-2026-05-batch3
May 23, 2026
Merged

feat: add 9 scrapers (52 → 61 stable sources) + docs support table#32
okkymabruri merged 6 commits into
mainfrom
feature/source-expansion-2026-05-batch3

Conversation

@okkymabruri

Copy link
Copy Markdown
Owner

Summary

Adds 9 new scrapers bringing total stable sources from 52 → 61.

New Sources

Source Mode Notes
Fajar search + latest WordPress /?s= search; Sulawesi regional coverage
Mojok search + latest WordPress search; satire/opinion/youth angle
Grid search + latest /search?q= endpoint; lifestyle/pop culture
Hipwee search + latest WordPress search; youth demographic angle
Jakarta Globe search + latest /search/{keyword}; English coverage
RMOL search + latest /tag/{keyword} tag-based search; political commentary
CNA Indonesia latest only Algolia JS search unsupported; /terbaru
Niaga.Asia search + latest WordPress search; business/economy; Kalimantan
Jakarta Selaras search + latest Custom CMS; RSS/sitemap keyword filtering

Fixes Applied

  • Mojok link filtering: Added /cdn-cgi/, /login/, /wp- skips; tightened regex
  • Hipwee link filtering: Added /cdn-cgi/, /dashboard/, /profile/, /login/ skips
  • Fajar category extraction: Fixed to use meta article:section not URL year
  • RMOL latest pagination: Returns None after page 1 to avoid repeated pages
  • RMOL registry name: Changed to "RMOL" (was "RM.ID" which conflicted with rmid)

Docs

  • Replaced 2-column table with 5-column Source|Slug|Search|Latest|Notes table

Verification

  • Registry: 61 total, 0 validation issues
  • Ruff: All checks passed
  • Non-network tests: 109 passed, 1 skipped
  • Live smoke parsing verified for all new scrapers

- README.md: update Supported Websites from 48 to 52, add 4 new
  (BeritaSatu, Gatra, Harian Jogja, SWA), add KBR, remove stale
  Hukumonline entry, reorder alphabetically
- docs/index.md: update count to 52, add all 12 missing scrapers
  (Bali Post, DailySocial, Gatra, Harian Jogja, Kaltim Post, KBR,
  Pantau, Project Multatuli, SWA, VOA Indonesia, VOI.id, BeritaSatu)
- docs/getting-started.md: update example output to 52
- docs/troubleshooting.md: update scraper count to 52
Priority batch:
- Fajar (fajar): Sulawesi regional, WordPress search
- Mojok (mojok): youth/satire/opinion, WordPress search
- Grid (grid): lifestyle/pop culture, Kompas Gramedia network
- Hipwee (hipwee): youth demographic, WordPress search
- Jakarta Globe (jakartaglobe): English alternative, custom search

Stretch:
- RMOL (rmol): political commentary, tag-based search
- CNA Indonesia (cnaindonesia): SE Asia international, latest-only

Updated README.md, docs/index.md, docs/changelog.md.
- Mojok: add cdn-cgi/login/wp- skips, require 2-segment article paths
- Hipwee: add cdn-cgi/dashboard/profile/login/user skips, allow
  2-segment article URLs (narasi/slug, showbiz/slug)
- Fajar: fix category extraction from meta instead of URL year
- RMOL: build_latest_url returns None after page 1 (no validated
  paginated index), preventing repeated pages
- Registry: RMOL display name changed to 'RMOL' (was 'RM.ID')
- Docs: getting-started, troubleshooting, index counts updated to 59
- WordPress /?s= search + latest support
- Business/economy focus with Kalimantan coverage
- Updated README, docs, changelog counts to 60
…ble sources)

- Custom CMS with RSS keyword filtering for search mode
- Latest mode via /rss feed
- Article URL pattern: /detail/{id}/{slug}
- Replaced 2-column Source|Domain table with 5-column
  Source|Slug|Search|Latest|Notes table in docs/index.md
- Updated README count to 61
- jakartaselarascoid: fix AttributeError on self.current_keyword;
  store keyword during build_search_url, read in parse_article_links
- voi: fix parse_latest_article_links excluding all article URLs;
  condition was '/artikel/ not in href' which dropped every article
- viva: add null guards for all .find().get_text() calls;
  replace with select_one + explicit None checks
- suaramerdeka: fix build_latest_url page>1 sending search endpoint
  with no query; now uses /page/{page}/ for proper pagination
- tvrinews: fix category extraction for subdomain URLs;
  use urlparse hostname subdomain instead of base_url.replace()
- rmid: fix parse_article_links keyword filter using only
  self.keywords[0]; now uses _current_keyword set per-search
- suara: move seen_urls set outside pagination loop to preserve
  cross-page deduplication
@okkymabruri okkymabruri merged commit 1279486 into main May 23, 2026
8 checks passed
@okkymabruri okkymabruri deleted the feature/source-expansion-2026-05-batch3 branch May 23, 2026 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant