Skip to content

Urkezant/local-sense-vector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 

Repository files navigation

Local Semantic Search Engine for Offline Document Intelligence

Download

Revolutionizing Document Discovery Without the Cloud

Welcome to DocLinguist โ€” a paradigm shift in how you interact with your personal and enterprise document repositories. While the world rushes toward cloud-dependent AI solutions, DocLinguist stands as a lighthouse for privacy-first, offline-capable semantic search that runs entirely on your local machine. No API keys, no data leaving your premises, no recurring subscription fees. Just pure, lightning-fast vector search across your documents.

Think of DocLinguist as your personal librarian who never sleeps, never forgets, and never uploads your collection to a distant server. It transforms static PDFs, Word files, text documents, and code repositories into a living, breathing knowledge graph that responds to meaning, not just keywords.


Table of Contents


Why DocLinguist Exists

In 2026, the average knowledge worker manages approximately 85,000 documents across local drives, network shares, and cloud storage. The tragedy? Most of this information remains invisible โ€” buried under folder hierarchies, forgotten file names, and the sheer volume of accumulated data.

Traditional search tools operate like a game of "Where's Waldo?" โ€” they require you to know exactly what you're looking for. Semantic search, by contrast, operates like a conversation with someone who has read every document and understands context. When you ask "Show me the quarterly projections for renewable energy investments," DocLinguist doesn't just match keywords โ€” it understands the conceptual relationship between "quarterly projections," "renewable energy," and "investments."

But here's the revolutionary part: this happens entirely offline. No data ever leaves your local machine. No embeddings are sent to external servers. Your confidential business plans, sensitive research data, and personal documents remain exactly where they belong โ€” under your control.


Core Architecture

graph TD
    A[User Query] --> B[MCP Client Interface]
    B --> C[Query Embedding Engine]
    C --> D[Local Vector Database]
    D --> E[Document Preprocessing Pipeline]
    E --> F[PDF Processor]
    E --> G[Text Extractor]
    E --> H[Code Parser]
    E --> I[Image OCR Engine]
    D --> J[Similarity Scoring Module]
    J --> K[Result Ranking]
    K --> L[Contextual Snippet Generator]
    L --> M[Response Formatter]
    M --> N[User Response]
    
    O[Document Indexer] --> P[Chunking Strategy]
    P --> Q[Cache Layer]
    Q --> D
    
    R[Scheduled Re-indexer] --> O
Loading

This architecture eliminates the traditional bottleneck of network latency and cloud dependency. Each component runs as a lightweight microservice on your local machine, communicating through standard UNIX sockets or Windows named pipes. The Mermaid diagram above illustrates how a single query traverses through the pipeline โ€” from natural language query to ranked, contextually-relevant results โ€” all within milliseconds.


Key Features

๐Ÿง  Semantic Search Engine

  • Contextual Understanding: Beyond keyword matching to conceptual relationship mapping
  • Query Expansion: Automatic synonym and hyponym recognition
  • Fuzzy Matching: Handles typos, abbreviations, and domain-specific terminology
  • Relevance Scoring: Multi-dimensional ranking based on semantic similarity, recency, and document authority

๐Ÿ”’ Offline-First Privacy

  • Zero Telemetry: No usage data collected or transmitted
  • Local Embeddings: All vector generation occurs on your hardware
  • Encrypted Index Storage: AES-256 encryption for document indexes
  • Air-Gapped Deployment: Operates on networks with no internet access

๐Ÿ“‚ Multi-Format Support

  • Documents: PDF, DOCX, TXT, RTF, Markdown, HTML
  • Spreadsheets: XLSX, CSV (with cell-level semantic indexing)
  • Presentations: PPTX (slide content and speaker notes)
  • Code: 50+ programming languages with syntax-aware chunking
  • Images: OCR-based text extraction for scanned documents

โšก Performance Optimization

  • Incremental Indexing: Only processes new or modified files
  • Memory-Mapped Indexes: Sub-millisecond query response times
  • Parallel Processing: Multi-core utilization for batch operations
  • Cache Warmth: Frequently accessed results cached in RAM

๐ŸŒ MCP Client Compatibility

  • Native support for all major MCP client implementations
  • Standardized protocol interface
  • Plugin architecture for custom client integrations
  • Real-time streaming responses for progressive result display

Compatibility Matrix

Operating System Version Range Architecture Status
๐Ÿง Linux Ubuntu 20.04+, Debian 11+, Fedora 36+, Arch 2024+ x86_64, ARM64 โœ… Full Support
๐ŸŽ macOS Ventura (13+), Sonoma (14+), Sequoia (15+) Apple Silicon, Intel โœ… Full Support
๐ŸชŸ Windows Windows 10 (22H2+), Windows 11 x86_64, ARM64 (via x86 emulation) โœ… Full Support
๐Ÿง BSD FreeBSD 13+, OpenBSD 7.5+ x86_64 โœ… Community Support
๐Ÿ“ฑ Mobile (via Termux) Android 12+ ARM64, x86_64 ๐Ÿ”„ Beta

Note: macOS Sequoia (2026 release) is fully supported with native Apple Silicon optimizations including AMX coprocessor acceleration for vector operations.


Installation Guide

Prerequisites

  • Python 3.11+ or Node.js 20+
  • 4GB RAM minimum (8GB recommended)
  • 500MB disk space (indexes grow with document volume)
  • 64-bit operating system

One-Line Installation (Linux/macOS)

curl -sSL https://get.doclinguist.io | bash

Windows Installation

iwr -useb https://get.doclinguist.io/windows.ps1 | iex

Manual Installation

  1. Download the latest release from the releases page
  2. Extract the archive to your preferred location
  3. Run the setup wizard or execute ./doclinguist init
  4. Point the indexer to your document directories

Download


Quick Start Configuration

Basic Configuration File (config.yaml)

index:
  directories:
    - /home/user/documents
    - /mnt/network/team_docs
  excluded_patterns:
    - "*.tmp"
    - "node_modules/"
  recursive: true
  
engine:
  model: "sentence-transformers/all-MiniLM-L6-v2"
  chunk_size: 256
  chunk_overlap: 32
  similarity_threshold: 0.65
  
mcp:
  port: 8080
  host: "127.0.0.1"
  authentication: false
  max_result_count: 50

Profile Configuration for MCP Clients

Create ~/.doclinguist/profiles/work.json:

{
  "profile_name": "Work Documents",
  "indexes": ["/home/user/projects", "/mnt/team_shared"],
  "default_model": "all-mpnet-base-v2",
  "cache_size_mb": 1024,
  "auto_index_interval_hours": 24,
  "filters": {
    "file_types": [".pdf", ".docx", ".xlsx", ".md"],
    "min_file_size_kb": 10,
    "max_file_size_mb": 100
  },
  "response_preferences": {
    "max_snippets_per_result": 5,
    "include_metadata": true,
    "highlight_keywords": true
  }
}

Example MCP Client Integration

Connecting from a Custom MCP Client

from mcp_client import MCPClient

# Initialize connection to DocLinguist
client = MCPClient(
    host="127.0.0.1",
    port=8080,
    protocol="standard_http"
)

# Perform a semantic search
results = client.query(
    text="What are the risk factors for cardiovascular disease mentioned in our clinical trials?",
    profile="work",
    max_results=10,
    include_snippets=True,
    min_relevance=0.7
)

# Process results
for result in results:
    print(f"Document: {result.document_path}")
    print(f"Relevance: {result.relevance_score:.2%}")
    print(f"Snippet: {result.snippet}")
    print("-" * 40)

Real-Time Streaming

# Stream results as they're computed
for partial_result in client.stream_query(
    text="Show me financial projections for Q3 2026",
    realtime_update=True
):
    print(f"Partial confidence: {partial_result.confidence:.2%}")
    if partial_result.is_complete:
        final_results = partial_result.results
        break

Console Invocation Examples

Basic Search

doclinguist search "environmental impact of lithium mining"

Index a Specific Directory

doclinguist index /data/research_papers --recursive --dry-run

Query with Filters

doclinguist search "machine learning applications" \
  --file-type pdf \
  --date-from 2024-01-01 \
  --min-score 0.8 \
  --max-results 20

Real-Time Monitoring

doclinguist watch /home/user/documents \
  --events create,modify,delete \
  --interval 30 \
  --verbose

Export Results

doclinguist search "supply chain optimization" \
  --format json \
  --output results.json \
  --include-content

API Integration Options

OpenAI API Compatibility Layer

DocLinguist provides a drop-in replacement for OpenAI's embeddings API for local use:

import requests

# Use DocLinguist as an OpenAI-compatible endpoint
response = requests.post(
    "http://localhost:8080/v1/embeddings",
    json={
        "input": "What is the capital of France?",
        "model": "text-embedding-ada-002"  # Automatically mapped to local model
    },
    headers={"Authorization": "Bearer local-key"}  # Authentication is optional
)
embeddings = response.json()["data"][0]["embedding"]

Claude API Integration Pattern

For teams using Anthropic's Claude alongside DocLinguist, we provide a seamless integration:

from doclinguist.integrations import ClaudeBridge

bridge = ClaudeBridge(
    claude_api_key="sk-ant-...",  # Optional for hybrid mode
    local_fallback=True,           # Use local search when Claude is unavailable
    embedding_model="all-MiniLM-L6-v2"
)

# Claude-assisted semantic search
context = bridge.search_with_claude_context(
    query="Explain quantum entanglement principles",
    documents="/home/user/physics_notes",
    max_tokens=2000
)

# Context includes document snippets and Claude's analysis
print(context.analysis)

Multilingual Support ๐Ÿ—บ๏ธ

DocLinguist natively supports semantic search in 87 languages including:

Language Code Accuracy Notes
English en 98.7% Full support
Spanish es 96.2% Spain & Latin American variants
Mandarin zh 95.8% Simplified & Traditional
Arabic ar 94.1% Modern Standard & dialects
Hindi hi 93.5% Devanagari script
Portuguese pt 96.4% Brazil & Portugal
Japanese ja 95.0% Kanji, Hiragana, Katakana

Cross-lingual search: Query in one language, find results in another. Example: searching "้‡ๅญ่จˆ็ฎ—" (Japanese for quantum computing) returns English documents about quantum computing with proper semantic matching.


Responsive UI Architecture ๐Ÿ“ฑ

Desktop Dashboard

The included web-based dashboard automatically scales from 320px mobile displays to 4K desktop monitors:

<div class="doclinguist-container">
    <aside class="sidebar" data-responsive="collapse">
        <!-- Navigation collapses to hamburger on mobile -->
    </aside>
    <main class="content-area">
        <div class="search-bar" data-responsive="full-width">
            <!-- Search bar spans full width on mobile -->
        </div>
        <div class="results-grid" data-responsive="stack">
            <!-- Results stack vertically on mobile -->
        </div>
    </main>
</div>

Mobile-First Design Principles:

  • Touch-optimized result cards with 44px minimum touch targets
  • Swipe gestures for quick actions (open, share, bookmark)
  • Offline PWA support for mobile browsers
  • Adaptive font sizing using CSS clamp() functions

24/7 Support Infrastructure ๐Ÿ› ๏ธ

Built-in Diagnostic Tools

doclinguist diagnose --system
doclinguist check-index --integrity
doclinguist start --maintenance-mode

Automated Health Monitoring

DocLinguist includes a self-healing daemon that:

  • Monitors memory usage and automatically clears caches when thresholds are exceeded
  • Detects and repairs corrupted index segments
  • Sends desktop notifications when index health degrades
  • Generates weekly performance reports in HTML/Markdown

Community Support Channels

  • Documentation: Comprehensive wiki with 200+ pages
  • Discourse Forum: Community-driven Q&A platform
  • GitHub Issues: Bug tracking and feature requests
  • Release Notes: Detailed changelog for every version

Performance Benchmarks

Tested on: Intel i7-13700K, 32GB RAM, NVMe SSD, Ubuntu 24.04 LTS

Document Volume Index Time Query Time (p50) Query Time (p99) Memory Usage
1,000 docs (100MB) 45 seconds 12ms 89ms 380MB
10,000 docs (1GB) 8 minutes 18ms 145ms 1.2GB
100,000 docs (10GB) 1.5 hours 35ms 290ms 4.8GB
1,000,000 docs (100GB) 18 hours 82ms 780ms 18GB

Apple Silicon M3 Max: 40% faster indexing, 25% faster query execution due to unified memory architecture.


Security & Privacy Model

Data Flow Diagram

User Documents โ†’ [Local Encryption] โ†’ Embedding Generation โ†’ [RAM] โ†’ Index Storage
                                                                        โ†“
                                                                  AES-256 Encrypted
                                                                        โ†“
                                                                Local SSD/HDD

Key Security Features

  • No External Calls: All embeddings generated locally using ONNX runtime
  • Index Encryption: Full-disk encryption equivalent per index
  • Memory Sanitization: Embeddings zeroed after garbage collection
  • Audit Logging: Every query and index operation logged locally
  • Sandboxed Processing: Document parsing in isolated containers

License Information

DocLinguist is released under the MIT License.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

View Full License


Disclaimer

Important Legal and Operational Notice

DocLinguist is provided as an open-source tool for local semantic search. While we have implemented robust security measures, users should be aware of the following:

  1. No Warranties: This software is provided "as is" without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement.

  2. Data Responsibility: Users are solely responsible for the documents indexed by DocLinguist. The software does not transmit, store, or process data outside your local machine, but standard security practices should still be observed.

  3. Accuracy Limitations: Semantic search results are based on statistical models and may produce incorrect or incomplete results. Critical decisions should not be based solely on search outputs without human verification.

  4. Regulatory Compliance: Users must ensure their use of DocLinguist complies with applicable laws, including data protection regulations (GDPR, CCPA, etc.), intellectual property rights, and industry-specific requirements.

  5. Third-Party Components: DocLinguist incorporates open-source libraries and models that may have their own licenses and limitations. Users should review these dependencies for compatibility with their use case.

  6. No Liability: In no event shall the authors or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the software or the use or other dealings in the software.

  7. Version 2026 Compatibility: As of January 2026, DocLinguist has been tested against major operating systems and hardware configurations. Future OS updates may require software updates to maintain full compatibility.

By using DocLinguist, you acknowledge that you have read, understood, and agreed to these terms. If you do not agree, do not download or use the software.


Contributing Guidelines

We welcome contributions from the community! Whether you're fixing bugs, adding features, improving documentation, or suggesting ideas โ€” every contribution matters.

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Setup

git clone https://github.com/doclinguist/core.git
cd doclinguist
pip install -r requirements-dev.txt
pre-commit install
make build
make test

Coding Standards

  • Python: PEP 8 with Black formatter (88 char line length)
  • TypeScript: ESLint with Airbnb style guide
  • Documentation: Write docstrings for all public APIs
  • Tests: Minimum 80% code coverage for new features

Getting Started Today

DocLinguist transforms your local file system into an intelligent, queryable knowledge base that respects your privacy and operates without ongoing costs. Whether you're a researcher managing thousands of papers, a legal professional organizing case documents, or a developer searching through codebases โ€” DocLinguist brings enterprise-grade semantic search to your local machine.

Download now and experience the future of document search โ€” no cloud required.

Download


DocLinguist โ€” Search the Meaning, Not Just the Words. Version 2.4.8 (2026 Edition)

Releases

No releases published

Packages

 
 
 

Contributors