ask-doc is a 100% local Command Line Interface (CLI) application built with Node.js and TypeScript. It is designed to ingest local documents, process them into chunks, and store them for hybrid search (combining BM25 sparse search and Vector embeddings) without ever sending data to the cloud.
- Local Embeddings: Utilizes
@huggingface/transformers(v4) and the ONNX runtime to generate embeddings locally. - Hybrid Search Ready: Processes documents for both BM25 (sparse) and Vector (dense) search.
- Multi-format & OCR Support: Ingests
.md,.txt,.pdf,.docx,.xlsx, and images using local OCR. - Dynamic Configuration: Manage all runtime settings (models, paths, chunking) directly via the CLI.
- Persistent Storage: Utilizes LanceDB for high-performance, local vector storage.
The CLI is capable of processing a variety of document types for local ingestion:
- Text & Documentation:
.md,.txt - Portable Documents:
.pdf - Microsoft Office:
.docx,.xlsx - Images (via OCR): Supports common image formats through the integrated Tesseract.js engine.
The project uses Tesseract.js for local text extraction from images. The process is fully offline:
- Detection: The
File Walkeridentifies image files by extension. - Worker Lifecycle: A local OCR worker is instantiated for each image.
- Recognition: The engine analyzes the image and returns structured text strings.
- Memory Management: Workers are terminated immediately after extraction to ensure low memory overhead.
- Standardization: Extracted text is sent to the
Chunker, making image content searchable via the same vector/BM25 pipeline as text documents.
graph TD
User([User]) --> CLI[ask-doc CLI]
CLI --> CmdRouter{Commander.js}
subgraph "Ingestion Engine"
CmdRouter --> Ingest[Ingest Command]
Ingest --> Walker[File Walker]
Walker --> Docs[(Local Docs)]
Ingest --> Parser[Document Parsers]
Parser --> Chunker[Text Chunker]
Chunker --> Embedder[Embedding Service]
Embedder --> WorkerPool[Worker Pool]
WorkerPool --> Transformers["@huggingface/transformers"]
Transformers --> Model[(Local ONNX Model)]
Chunker --> BM25[BM25 Service]
Embedder --> Storage[Storage Service]
BM25 --> Storage
Storage --> VStore[(LanceDB - Local)]
end
subgraph "Configuration Management"
CmdRouter --> Config[Config Command]
Config --> ConfigFile[(config.json)]
end
- Runtime: Node.js (ESM)
- Language: TypeScript
- CLI Framework: Commander.js
- Machine Learning: @huggingface/transformers
- File System:
fs-extrafor robust directory and file operations. - UI:
orafor terminal spinners andchalkfor colorized output.
βββ package.json
βββ tsconfig.json
βββ config.json # Central configuration file
βββ model/ # Local storage for ONNX models
βββ vector-store/ # Local index storage
βββ src/
βββ index.ts # Entry point and command registration
βββ commands/ # Ingest and Config command implementations
βββ services/ # Embedding, BM25, and Storage logic
βββ scripts/ # Utility scripts (e.g., model download)
βββ utils/ # File system utilities
-
Clone the repository and install dependencies:
npm install
-
Build the project:
npm run build
-
Link the CLI (Optional):
npm link
- Build the project:
npm run build
- Run the download script:
npm run download-models
This command will download the Xenova/all-MiniLM-L6-v2 model (as specified in your config.json) and place its files into the ./model/embeddings/all-MiniLM-L6-v2 directory, making it available for local use by the ask-doc CLI.
Scan a local directory, parse documents, and generate local embeddings and BM25 indices.
- Ingest all files in a directory:
ask-doc ingest --path ./docs
- Ingest specific file types:
ask-doc ingest --path ./docs --filetype .pdf
Retrieve settings from the central config.json file.
- View model configuration:
ask-doc config get models
Update configuration values directly from the CLI.
- Modify chunk size:
ask-doc config set ingestion --key chunk_size --value 800 - Disable a model:
ask-doc config set models --key active --value false --id xenova-minilm
Utility to fetch pre-trained models for local use.
npm run download-models-
ask-doc queryCommand: Implement hybrid search (BM25 + Vector) with reranking support. - Metadata Filtering: Allow filtering search results by file path, creation date, or custom tags.
- Index Integrity: Enhance validation scripts to auto-repair corrupted or outdated indices.
- Local LLM Integration: Integrate with Ollama or local ONNX-based LLMs (e.g., Llama 3) to provide natural language answers.
- Reranking: Implement a local Cross-Encoder to significantly improve retrieval precision.
- Semantic Chunking: Move beyond fixed-size chunks to intelligent splitting based on document structure and context.
- Desktop GUI: A cross-platform desktop interface for users who prefer a visual workspace.
- API Mode: Headless mode to serve the
ask-docengine as a local REST API.
- Walking: The
fileWalkerutility recursively scans the provided path for the specified file extension. - Chunking: Documents are split into overlapping segments based on
chunk_sizeandchunk_overlapdefined inconfig.json. - Embedding: The
EmbeddingServiceloads a local model from the./model/directory (using ONNX runtime) to transform text chunks into vectors. - Storage:
- Vectors: Persisted in LanceDB, enabling sub-millisecond retrieval of context chunks.
- BM25: A sparse index is built to support keyword-based retrieval alongside semantic search.
MIT