feat: add 9 scrapers (52 → 61 stable sources) + docs support table#32
Merged
Conversation
- README.md: update Supported Websites from 48 to 52, add 4 new (BeritaSatu, Gatra, Harian Jogja, SWA), add KBR, remove stale Hukumonline entry, reorder alphabetically - docs/index.md: update count to 52, add all 12 missing scrapers (Bali Post, DailySocial, Gatra, Harian Jogja, Kaltim Post, KBR, Pantau, Project Multatuli, SWA, VOA Indonesia, VOI.id, BeritaSatu) - docs/getting-started.md: update example output to 52 - docs/troubleshooting.md: update scraper count to 52
Priority batch: - Fajar (fajar): Sulawesi regional, WordPress search - Mojok (mojok): youth/satire/opinion, WordPress search - Grid (grid): lifestyle/pop culture, Kompas Gramedia network - Hipwee (hipwee): youth demographic, WordPress search - Jakarta Globe (jakartaglobe): English alternative, custom search Stretch: - RMOL (rmol): political commentary, tag-based search - CNA Indonesia (cnaindonesia): SE Asia international, latest-only Updated README.md, docs/index.md, docs/changelog.md.
- Mojok: add cdn-cgi/login/wp- skips, require 2-segment article paths - Hipwee: add cdn-cgi/dashboard/profile/login/user skips, allow 2-segment article URLs (narasi/slug, showbiz/slug) - Fajar: fix category extraction from meta instead of URL year - RMOL: build_latest_url returns None after page 1 (no validated paginated index), preventing repeated pages - Registry: RMOL display name changed to 'RMOL' (was 'RM.ID') - Docs: getting-started, troubleshooting, index counts updated to 59
- WordPress /?s= search + latest support - Business/economy focus with Kalimantan coverage - Updated README, docs, changelog counts to 60
…ble sources)
- Custom CMS with RSS keyword filtering for search mode
- Latest mode via /rss feed
- Article URL pattern: /detail/{id}/{slug}
- Replaced 2-column Source|Domain table with 5-column
Source|Slug|Search|Latest|Notes table in docs/index.md
- Updated README count to 61
- jakartaselarascoid: fix AttributeError on self.current_keyword;
store keyword during build_search_url, read in parse_article_links
- voi: fix parse_latest_article_links excluding all article URLs;
condition was '/artikel/ not in href' which dropped every article
- viva: add null guards for all .find().get_text() calls;
replace with select_one + explicit None checks
- suaramerdeka: fix build_latest_url page>1 sending search endpoint
with no query; now uses /page/{page}/ for proper pagination
- tvrinews: fix category extraction for subdomain URLs;
use urlparse hostname subdomain instead of base_url.replace()
- rmid: fix parse_article_links keyword filter using only
self.keywords[0]; now uses _current_keyword set per-search
- suara: move seen_urls set outside pagination loop to preserve
cross-page deduplication
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds 9 new scrapers bringing total stable sources from 52 → 61.
New Sources
Fixes Applied
Docs
Verification