Multi-agent deep research system with citation grounding and configurable output language. Give it a topic, get a professional report from 20-50 quality web sources β with hallucination control, quote verification, and quality metrics.
Built for low-VRAM hardware (8GB GPU) using local Ollama + optional remote LLM (OpenAI-compatible). Outputs English by default; Turkish supported as first-class language.
π See a full report β examples/sample_output_en.md
Most "deep research" tools are either:
- Black-box hosted SaaS (Perplexity, GPT Research) β you can't audit, can't self-host, can't customize
- English-only with poor Turkish output
- Either too generic to be useful, or too narrow to be reused
DeepResearch is transparent, self-hostable, language-flexible, and citation-grounded β you see every step, fix every prompt, and prove every claim.
13-agent pipeline. Each agent has one job (Single Responsibility Principle for prompts):
ORCHESTRATOR β Subtopic & query planning (JSON output)
SCOUT β Tavily search (parallel queries with raw_content)
RELEVANCE FILTER β JSON-based topic check + domain-quality scoring
SCRAPER β Uses Tavily raw_content; BeautifulSoup fallback
ANALYST β Extract findings (β₯7) with inline citations
CoVe-PLANNER β Generate atomic verification questions
CoVe-ANSWERER ΓN β Independent (cold-context) answers from sources
CoVe-JUDGE β Consistency check between analysis β independent answers
FACT-CHECKER β Skeptical labeling (VERIFIED / SUSPICIOUS / UNVERIFIED / OFF_TOPIC)
SYNTHESIZER β Draft report (configurable language)
CRITIC β JSON scoring + gap-detection + drill-down queries
PUBLISHER β Editorial polish (Turkish glossary applied if LANG=tr)
CITATION INSERTER β Generate [CLAIM]/[QUOTE]/[SRC] blocks (strong model)
CITATION VERIFIER β rapidfuzz quote matching against source pool (β₯75% threshold)
POST-PROCESS β URL validation (fuzzy match β₯92%) + language fixes
Two-round adaptive depth: if CRITIC scores < 7/10, the system uses CRITIC's gap analysis to generate focused drill-down queries for round 2.
qwen2.5:7b reliably follows one instruction per call but drops some when given many. Original SYNTHESIZER tried to "write + translate + cite + format" simultaneously β it dropped citations 100% of the time. Splitting into WRITER β CITATION INSERTER restored 100% citation compliance.
"multi-agent" appears in the wild as multi-agent, multi agent, multiagent. derive_topic_keywords() expands all variants. Scrape stage requires at least one compound variant to appear verbatim in source text.
Each result gets a quality score (-3 to +3). Forums/dictionaries/shopping = -3 (auto-filtered). Academic/consulting (arxiv, mckinsey, gartner, .edu, .gov) = +3. Sources are sorted by quality before scraping budget allocation.
Citations are generated AFTER the publisher polish (not before β otherwise translation can corrupt tags). rapidfuzz.partial_ratio with 75% threshold catches paraphrase but rejects fabrication. Falls back to difflib if rapidfuzz isn't installed.
5. Factored Chain-of-Verification (arXiv 2309.11495)
Verification questions are answered in fresh contexts (no chat history) β preventing the model from defending its previous answer. Inconsistencies flagged for FACT-CHECKER and used to penalize the score.
If zero sources pass filtering, PUBLISHER is never called. The system writes an explicit failure message instead of fabricating content. Earlier versions hallucinated fake academic references when given empty input; this guard makes it impossible.
Cheap local models (qwen2.5:7b via Ollama) for mechanical work (filtering, JSON parsing, search planning). Strong remote model (OpenAI-compatible, e.g. gpt-oss-120b) for judgment-heavy work (synthesis, criticism, citation). Configurable per role via strong=True flag.
git clone https://github.com/emreconscience/DeepResearch.git
cd DeepResearch
# Dependencies
pip install -r requirements.txt
# Local LLM (for mechanical roles)
# https://ollama.com
ollama pull qwen2.5:7b
# Configure API keys
cp .env.example .env
# Edit .env β at minimum set TAVILY_API_KEY
# (Tavily free tier: 1000 searches/month, no credit card)
# Run (English output by default)
python deep_research_v2.py "AI agent orchestration architecture"
# Turkish output
OUTPUT_LANG=tr python deep_research_v2.py "AI agent orchestration architecture"Output is written to research_output.md.
All via environment variables (see .env.example):
| Variable | Default | Purpose |
|---|---|---|
TAVILY_API_KEY |
empty | Required for web search (DuckDuckGo fallback if missing) |
OUTPUT_LANG |
en |
Output language: en or tr |
REMOTE_API_KEY |
empty | Optional OpenAI-compatible API for strong roles |
REMOTE_BASE_URL |
empty | e.g. https://api.openai.com/v1 |
REMOTE_MODEL |
gpt-oss-120b |
Model name for remote API |
MODEL_STRONG |
qwen2.5:7b |
Local fallback for strong roles |
MODEL_FAST |
qwen2.5:7b |
Local model for mechanical roles |
python eval_harness.py # run full eval set
python eval_harness.py rag_techniques # run a single testEval set in eval_set.json β 5 golden topics, each with must_cover concepts and expected_min_score. Metrics:
- Faithfulness (35%): claims supported by retrieved sources
- Relevancy (25%): report stays on topic
- Coverage (25%): expected concepts present (substring match)
- Hallucination (15%): detection of known bad patterns (e.g., fake academic references)
Each run is compared to the previous eval_latest.json for regression detection. Exit code is 1 if any test regresses by β₯0.5 points.
| Resource | Amount |
|---|---|
| Remote tokens (gpt-oss-120b-class) | ~17-20K (4 strong calls) |
| Local tokens (qwen2.5:7b) | ~30-50K (8+ mechanical calls) |
| Tavily searches | 4-6 queries |
| Wall time | 180-360s |
| Sources retrieved | 20-50 |
If you only use local models: zero external cost, ~300-450s per report, expect lower quality on judgment-heavy roles (CRITIC, SYNTHESIZER).
A 1-paragraph excerpt from a real run on "AI agent orchestration architecture":
AI agent orchestration is a control layer that brings together multiple specialized agents to automate complex workflows. Centralized or hybrid orchestrators handle task distribution, data sharing, and result aggregation, enabling real-time decision-making across customer service, supply chain, and other domains (https://www.deloitte.com/.../ai-agent-orchestration.html). Market research projects the global AI-agent market growing at 42% CAGR through 2031, reaching USD 57B (https://www.mordorintelligence.com/industry-reports/agentic-ai-market).
With citation block:
[CLAIM 2] AI agent market is projected to grow from $6.96B (2025) to $57.42B (2031).
[QUOTE 2] "The agentic AI market size was valued at USD 6.96 billion in 2025 and estimated to grow from USD 9.89 billion in 2026 to reach USD 57.42 billion by 2031, at a CAGR of 42.14% during the forecast period (2026-2031)."
[SRC 2] https://www.mordorintelligence.com/industry-reports/agentic-ai-market
- Tested on Python 3.13, WSL2 Ubuntu
- 8GB VRAM GPU runs qwen2.5:7b comfortably
- Tavily free tier sufficient for ~50 reports/month
- ~1500 lines of Python, no framework dependencies (no LangChain/LangGraph)
- Released under MIT
- Tavily β search API with content extraction
- Ollama β local LLM serving
- rapidfuzz β fast fuzzy matching
- Chain-of-Verification (Dhuliawala et al., 2023)
- FRONT: Fine-grained Citation
