DeepResearch

Multi-agent deep research system with citation grounding and configurable output language. Give it a topic, get a professional report from 20-50 quality web sources — with hallucination control, quote verification, and quality metrics.

Built for low-VRAM hardware (8GB GPU) using local Ollama + optional remote LLM (OpenAI-compatible). Outputs English by default; Turkish supported as first-class language.

📄 See a full report → examples/sample_output_en.md

Why this exists

Most "deep research" tools are either:

Black-box hosted SaaS (Perplexity, GPT Research) — you can't audit, can't self-host, can't customize
English-only with poor Turkish output
Either too generic to be useful, or too narrow to be reused

DeepResearch is transparent, self-hostable, language-flexible, and citation-grounded — you see every step, fix every prompt, and prove every claim.

Architecture

13-agent pipeline. Each agent has one job (Single Responsibility Principle for prompts):

ORCHESTRATOR     → Subtopic & query planning (JSON output)
SCOUT            → Tavily search (parallel queries with raw_content)
RELEVANCE FILTER → JSON-based topic check + domain-quality scoring
SCRAPER          → Uses Tavily raw_content; BeautifulSoup fallback
ANALYST          → Extract findings (≥7) with inline citations
CoVe-PLANNER     → Generate atomic verification questions
CoVe-ANSWERER ×N → Independent (cold-context) answers from sources
CoVe-JUDGE       → Consistency check between analysis ↔ independent answers
FACT-CHECKER     → Skeptical labeling (VERIFIED / SUSPICIOUS / UNVERIFIED / OFF_TOPIC)
SYNTHESIZER      → Draft report (configurable language)
CRITIC           → JSON scoring + gap-detection + drill-down queries
PUBLISHER        → Editorial polish (Turkish glossary applied if LANG=tr)
CITATION INSERTER → Generate [CLAIM]/[QUOTE]/[SRC] blocks (strong model)
CITATION VERIFIER → rapidfuzz quote matching against source pool (≥75% threshold)
POST-PROCESS     → URL validation (fuzzy match ≥92%) + language fixes

Two-round adaptive depth: if CRITIC scores < 7/10, the system uses CRITIC's gap analysis to generate focused drill-down queries for round 2.

Key technical decisions

1. Single-responsibility agents

qwen2.5:7b reliably follows one instruction per call but drops some when given many. Original SYNTHESIZER tried to "write + translate + cite + format" simultaneously — it dropped citations 100% of the time. Splitting into WRITER → CITATION INSERTER restored 100% citation compliance.

2. Topic-lock with compound term variants

"multi-agent" appears in the wild as multi-agent, multi agent, multiagent. derive_topic_keywords() expands all variants. Scrape stage requires at least one compound variant to appear verbatim in source text.

3. Domain quality scoring

Each result gets a quality score (-3 to +3). Forums/dictionaries/shopping = -3 (auto-filtered). Academic/consulting (arxiv, mckinsey, gartner, .edu, .gov) = +3. Sources are sorted by quality before scraping budget allocation.

4. Citation grounding with fuzzy match

Citations are generated AFTER the publisher polish (not before — otherwise translation can corrupt tags). rapidfuzz.partial_ratio with 75% threshold catches paraphrase but rejects fabrication. Falls back to difflib if rapidfuzz isn't installed.

5. Factored Chain-of-Verification (arXiv 2309.11495)

Verification questions are answered in fresh contexts (no chat history) — preventing the model from defending its previous answer. Inconsistencies flagged for FACT-CHECKER and used to penalize the score.

6. Hallucination guard

If zero sources pass filtering, PUBLISHER is never called. The system writes an explicit failure message instead of fabricating content. Earlier versions hallucinated fake academic references when given empty input; this guard makes it impossible.

7. Mixed-model pipeline

Cheap local models (qwen2.5:7b via Ollama) for mechanical work (filtering, JSON parsing, search planning). Strong remote model (OpenAI-compatible, e.g. gpt-oss-120b) for judgment-heavy work (synthesis, criticism, citation). Configurable per role via strong=True flag.

Quickstart

git clone https://github.com/emreconscience/DeepResearch.git
cd DeepResearch

# Dependencies
pip install -r requirements.txt

# Local LLM (for mechanical roles)
# https://ollama.com
ollama pull qwen2.5:7b

# Configure API keys
cp .env.example .env
# Edit .env — at minimum set TAVILY_API_KEY
# (Tavily free tier: 1000 searches/month, no credit card)

# Run (English output by default)
python deep_research_v2.py "AI agent orchestration architecture"

# Turkish output
OUTPUT_LANG=tr python deep_research_v2.py "AI agent orchestration architecture"

Output is written to research_output.md.

Configuration

All via environment variables (see .env.example):

Variable	Default	Purpose
`TAVILY_API_KEY`	empty	Required for web search (DuckDuckGo fallback if missing)
`OUTPUT_LANG`	`en`	Output language: `en` or `tr`
`REMOTE_API_KEY`	empty	Optional OpenAI-compatible API for strong roles
`REMOTE_BASE_URL`	empty	e.g. `https://api.openai.com/v1`
`REMOTE_MODEL`	`gpt-oss-120b`	Model name for remote API
`MODEL_STRONG`	`qwen2.5:7b`	Local fallback for strong roles
`MODEL_FAST`	`qwen2.5:7b`	Local model for mechanical roles

Evaluation

python eval_harness.py             # run full eval set
python eval_harness.py rag_techniques  # run a single test

Eval set in eval_set.json — 5 golden topics, each with must_cover concepts and expected_min_score. Metrics:

Faithfulness (35%): claims supported by retrieved sources
Relevancy (25%): report stays on topic
Coverage (25%): expected concepts present (substring match)
Hallucination (15%): detection of known bad patterns (e.g., fake academic references)

Each run is compared to the previous eval_latest.json for regression detection. Exit code is 1 if any test regresses by ≥0.5 points.

Per-report cost (typical)

Resource	Amount
Remote tokens (gpt-oss-120b-class)	~17-20K (4 strong calls)
Local tokens (qwen2.5:7b)	~30-50K (8+ mechanical calls)
Tavily searches	4-6 queries
Wall time	180-360s
Sources retrieved	20-50

If you only use local models: zero external cost, ~300-450s per report, expect lower quality on judgment-heavy roles (CRITIC, SYNTHESIZER).

Example output

A 1-paragraph excerpt from a real run on "AI agent orchestration architecture":

AI agent orchestration is a control layer that brings together multiple specialized agents to automate complex workflows. Centralized or hybrid orchestrators handle task distribution, data sharing, and result aggregation, enabling real-time decision-making across customer service, supply chain, and other domains (https://www.deloitte.com/.../ai-agent-orchestration.html). Market research projects the global AI-agent market growing at 42% CAGR through 2031, reaching USD 57B (https://www.mordorintelligence.com/industry-reports/agentic-ai-market).

With citation block:

[CLAIM 2] AI agent market is projected to grow from $6.96B (2025) to $57.42B (2031).
[QUOTE 2] "The agentic AI market size was valued at USD 6.96 billion in 2025 and estimated to grow from USD 9.89 billion in 2026 to reach USD 57.42 billion by 2031, at a CAGR of 42.14% during the forecast period (2026-2031)."
[SRC 2] https://www.mordorintelligence.com/industry-reports/agentic-ai-market

Project status

Tested on Python 3.13, WSL2 Ubuntu
8GB VRAM GPU runs qwen2.5:7b comfortably
Tavily free tier sufficient for ~50 reports/month
~1500 lines of Python, no framework dependencies (no LangChain/LangGraph)
Released under MIT

Acknowledgments & references

Tavily — search API with content extraction
Ollama — local LLM serving
rapidfuzz — fast fuzzy matching
Chain-of-Verification (Dhuliawala et al., 2023)
FRONT: Fine-grained Citation

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
examples		examples
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
deep_research_v2.py		deep_research_v2.py
eval_harness.py		eval_harness.py
eval_set.json		eval_set.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepResearch

Why this exists

Architecture

Key technical decisions

1. Single-responsibility agents

2. Topic-lock with compound term variants

3. Domain quality scoring

4. Citation grounding with fuzzy match

5. Factored Chain-of-Verification (arXiv 2309.11495)

6. Hallucination guard

7. Mixed-model pipeline

Quickstart

Configuration

Evaluation

Per-report cost (typical)

Example output

Project status

Acknowledgments & references

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepResearch

Why this exists

Architecture

Key technical decisions

1. Single-responsibility agents

2. Topic-lock with compound term variants

3. Domain quality scoring

4. Citation grounding with fuzzy match

5. Factored Chain-of-Verification (arXiv 2309.11495)

6. Hallucination guard

7. Mixed-model pipeline

Quickstart

Configuration

Evaluation

Per-report cost (typical)

Example output

Project status

Acknowledgments & references

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages