Convert webpage HTML into Markdown that is easier to use in RAG/LLM pipelines.
This repo has two parts:
- Web app (
index.html) for manual conversion - CLI (
cli/) for scripted and batch workflows
- Converts HTML to Markdown (Turndown + GFM)
- Supports CSS selector targeting (
--selector/ selector input) - Cleans common page noise (ads, nav, cookie banners, hidden elements)
- Optional media stripping with context placeholders
- Optional link stripping
- Table normalization/alignment
- Code fence language detection from HTML classes
- Metadata extraction (title/meta tags/canonical/Open Graph/Twitter + JSON-LD)
- Basic boilerplate dedupe in output
- Smart content fallback when converting full
body
- Web app fetching relies on public CORS proxies.
- Web app cannot reliably handle many JS-rendered, anti-bot, or authenticated pages.
- CLI
--render-jsneeds Playwright installed and a browser binary available. - This project does not do chunking, embedding, retrieval, or vector indexing.
Open index.html directly, or serve locally:
python -m http.server 8000Then open http://localhost:8000.
cd cli
npm install
node bin/md4llm.js https://example.comOptional JS-render support:
npm install playwright
npx playwright install chromium# Basic conversion
md4llm https://example.com
# Extract only article content
md4llm https://example.com -s "article" -o output.md
# JSON output with metadata
md4llm https://example.com --meta --format json
# Strip links and media for cleaner embeddings
md4llm https://example.com --no-links --strip-media
# Batch mode
md4llm --batch urls.txt -o ./output/
# Use browser rendering for JS-heavy pages
md4llm https://example.com/docs --render-js-autoAlign TablesStrip MediaSmart CleanExtract MetaKeep Links
Ctrl+Enter: ConvertCtrl+Shift+C: Copy outputCtrl+Shift+F: Fetch URLCtrl+Shift+X: Clear input?: Show helpEsc: Close modal
{
"markdown": "...",
"metadata": {},
"sourceUrl": "https://example.com/page",
"selector": "article",
"timestamp": "2026-03-08T12:00:00.000Z",
"options": {},
"stats": {
"characters": 0,
"words": 0,
"lines": 0
}
}Typical flow:
- Convert page(s) to Markdown with relevant selector/options.
- Normalize/chunk in your ingestion pipeline.
- Store chunks + metadata in your retrieval store.
- Add golden-fixture regression tests with real pages (docs/blogs/forums/tables-heavy pages).
- Add a
--chunkmode in CLI (size/overlap/token-estimate) for direct ingestion prep. - Add a first-party fetch service for the web app (replace public CORS proxies).
- Add quality scoring in JSON output (content ratio, link density, boilerplate ratio).
- Add deterministic normalization profiles (
strict,balanced,raw) for different training/indexing use cases.
index.html
app.html
css/main.css
js/app.js
js/config.js
cli/
MIT