Skip to content

emreconscience/DeepResearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DeepResearch

License: MIT Python 3.10+ Ollama Tavily Status Languages

Multi-agent deep research system with citation grounding and configurable output language. Give it a topic, get a professional report from 20-50 quality web sources β€” with hallucination control, quote verification, and quality metrics.

Built for low-VRAM hardware (8GB GPU) using local Ollama + optional remote LLM (OpenAI-compatible). Outputs English by default; Turkish supported as first-class language.

DeepResearch terminal output

πŸ“„ See a full report β†’ examples/sample_output_en.md


Why this exists

Most "deep research" tools are either:

  • Black-box hosted SaaS (Perplexity, GPT Research) β€” you can't audit, can't self-host, can't customize
  • English-only with poor Turkish output
  • Either too generic to be useful, or too narrow to be reused

DeepResearch is transparent, self-hostable, language-flexible, and citation-grounded β€” you see every step, fix every prompt, and prove every claim.


Architecture

13-agent pipeline. Each agent has one job (Single Responsibility Principle for prompts):

ORCHESTRATOR     β†’ Subtopic & query planning (JSON output)
SCOUT            β†’ Tavily search (parallel queries with raw_content)
RELEVANCE FILTER β†’ JSON-based topic check + domain-quality scoring
SCRAPER          β†’ Uses Tavily raw_content; BeautifulSoup fallback
ANALYST          β†’ Extract findings (β‰₯7) with inline citations
CoVe-PLANNER     β†’ Generate atomic verification questions
CoVe-ANSWERER Γ—N β†’ Independent (cold-context) answers from sources
CoVe-JUDGE       β†’ Consistency check between analysis ↔ independent answers
FACT-CHECKER     β†’ Skeptical labeling (VERIFIED / SUSPICIOUS / UNVERIFIED / OFF_TOPIC)
SYNTHESIZER      β†’ Draft report (configurable language)
CRITIC           β†’ JSON scoring + gap-detection + drill-down queries
PUBLISHER        β†’ Editorial polish (Turkish glossary applied if LANG=tr)
CITATION INSERTER β†’ Generate [CLAIM]/[QUOTE]/[SRC] blocks (strong model)
CITATION VERIFIER β†’ rapidfuzz quote matching against source pool (β‰₯75% threshold)
POST-PROCESS     β†’ URL validation (fuzzy match β‰₯92%) + language fixes

Two-round adaptive depth: if CRITIC scores < 7/10, the system uses CRITIC's gap analysis to generate focused drill-down queries for round 2.

Key technical decisions

1. Single-responsibility agents

qwen2.5:7b reliably follows one instruction per call but drops some when given many. Original SYNTHESIZER tried to "write + translate + cite + format" simultaneously β€” it dropped citations 100% of the time. Splitting into WRITER β†’ CITATION INSERTER restored 100% citation compliance.

2. Topic-lock with compound term variants

"multi-agent" appears in the wild as multi-agent, multi agent, multiagent. derive_topic_keywords() expands all variants. Scrape stage requires at least one compound variant to appear verbatim in source text.

3. Domain quality scoring

Each result gets a quality score (-3 to +3). Forums/dictionaries/shopping = -3 (auto-filtered). Academic/consulting (arxiv, mckinsey, gartner, .edu, .gov) = +3. Sources are sorted by quality before scraping budget allocation.

4. Citation grounding with fuzzy match

Citations are generated AFTER the publisher polish (not before β€” otherwise translation can corrupt tags). rapidfuzz.partial_ratio with 75% threshold catches paraphrase but rejects fabrication. Falls back to difflib if rapidfuzz isn't installed.

5. Factored Chain-of-Verification (arXiv 2309.11495)

Verification questions are answered in fresh contexts (no chat history) β€” preventing the model from defending its previous answer. Inconsistencies flagged for FACT-CHECKER and used to penalize the score.

6. Hallucination guard

If zero sources pass filtering, PUBLISHER is never called. The system writes an explicit failure message instead of fabricating content. Earlier versions hallucinated fake academic references when given empty input; this guard makes it impossible.

7. Mixed-model pipeline

Cheap local models (qwen2.5:7b via Ollama) for mechanical work (filtering, JSON parsing, search planning). Strong remote model (OpenAI-compatible, e.g. gpt-oss-120b) for judgment-heavy work (synthesis, criticism, citation). Configurable per role via strong=True flag.


Quickstart

git clone https://github.com/emreconscience/DeepResearch.git
cd DeepResearch

# Dependencies
pip install -r requirements.txt

# Local LLM (for mechanical roles)
# https://ollama.com
ollama pull qwen2.5:7b

# Configure API keys
cp .env.example .env
# Edit .env β€” at minimum set TAVILY_API_KEY
# (Tavily free tier: 1000 searches/month, no credit card)

# Run (English output by default)
python deep_research_v2.py "AI agent orchestration architecture"

# Turkish output
OUTPUT_LANG=tr python deep_research_v2.py "AI agent orchestration architecture"

Output is written to research_output.md.

Configuration

All via environment variables (see .env.example):

Variable Default Purpose
TAVILY_API_KEY empty Required for web search (DuckDuckGo fallback if missing)
OUTPUT_LANG en Output language: en or tr
REMOTE_API_KEY empty Optional OpenAI-compatible API for strong roles
REMOTE_BASE_URL empty e.g. https://api.openai.com/v1
REMOTE_MODEL gpt-oss-120b Model name for remote API
MODEL_STRONG qwen2.5:7b Local fallback for strong roles
MODEL_FAST qwen2.5:7b Local model for mechanical roles

Evaluation

python eval_harness.py             # run full eval set
python eval_harness.py rag_techniques  # run a single test

Eval set in eval_set.json β€” 5 golden topics, each with must_cover concepts and expected_min_score. Metrics:

  • Faithfulness (35%): claims supported by retrieved sources
  • Relevancy (25%): report stays on topic
  • Coverage (25%): expected concepts present (substring match)
  • Hallucination (15%): detection of known bad patterns (e.g., fake academic references)

Each run is compared to the previous eval_latest.json for regression detection. Exit code is 1 if any test regresses by β‰₯0.5 points.

Per-report cost (typical)

Resource Amount
Remote tokens (gpt-oss-120b-class) ~17-20K (4 strong calls)
Local tokens (qwen2.5:7b) ~30-50K (8+ mechanical calls)
Tavily searches 4-6 queries
Wall time 180-360s
Sources retrieved 20-50

If you only use local models: zero external cost, ~300-450s per report, expect lower quality on judgment-heavy roles (CRITIC, SYNTHESIZER).

Example output

A 1-paragraph excerpt from a real run on "AI agent orchestration architecture":

AI agent orchestration is a control layer that brings together multiple specialized agents to automate complex workflows. Centralized or hybrid orchestrators handle task distribution, data sharing, and result aggregation, enabling real-time decision-making across customer service, supply chain, and other domains (https://www.deloitte.com/.../ai-agent-orchestration.html). Market research projects the global AI-agent market growing at 42% CAGR through 2031, reaching USD 57B (https://www.mordorintelligence.com/industry-reports/agentic-ai-market).

With citation block:

[CLAIM 2] AI agent market is projected to grow from $6.96B (2025) to $57.42B (2031).
[QUOTE 2] "The agentic AI market size was valued at USD 6.96 billion in 2025 and estimated to grow from USD 9.89 billion in 2026 to reach USD 57.42 billion by 2031, at a CAGR of 42.14% during the forecast period (2026-2031)."
[SRC 2] https://www.mordorintelligence.com/industry-reports/agentic-ai-market

Project status

  • Tested on Python 3.13, WSL2 Ubuntu
  • 8GB VRAM GPU runs qwen2.5:7b comfortably
  • Tavily free tier sufficient for ~50 reports/month
  • ~1500 lines of Python, no framework dependencies (no LangChain/LangGraph)
  • Released under MIT

Acknowledgments & references

License

MIT

About

Multi-agent deep research system

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages