Official code for ACL2025 "🔍 Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models"
-
Updated
Dec 22, 2025 - JavaScript
Official code for ACL2025 "🔍 Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models"
Official codebase for the ACL 2025 Findings paper: Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval.
smallevals — CPU-fast, GPU-blazing fast offline retrieval evaluation for RAG systems with tiny QA models.
Published PyPI package for ArXiv embedding benchmarks, retrieval evaluation, and scientific RAG experiments.
RAG retrieval benchmark runner with JSON reports, Pareto plots, and regression gates for retrieval quality changes.
Deterministic RAG evaluation toolkit -- retrieval metrics (recall, precision, MRR), corpus overlap detection, and CI regression gating without model calls.
Open-source retrieval diagnostics toolkit for enterprise RAG pipelines
Bilingual RAG evaluation benchmark for culturally grounded English/Uzbek retrieval
Local-first memory infrastructure for coding workflows: deterministic retrieval, explainable traces, MCP/REST/SDK interfaces, and standalone browser-first operation.
Open multilingual RAG benchmark for retrieval-grounded educational question answering
Research-grade neuro-symbolic RAG framework where retrieval is a policy, not a vector search, built for evaluation, ablation, and reliability analysis.
Search and retrieval workbench with query planning, multi-source retrieval, citation checking, source-trust tiers, and extractive fallback.
A systems-level analysis of static RAG pipelines, isolating ingestion, retrieval, and ranking boundaries to expose structural failure modes before generation.
A controlled experiment evaluating whether hybrid (dense + sparse) retrieval surfaces evidence that dense-only RAG systems misrank—without changing generation behavior.
RAG retrieval quality evaluation and regression testing toolkit with golden sets, recall/MRR metrics, reports, and CI-friendly outputs.
Reproducible CoREB retrieval benchmark snapshot with CI-backed evaluation artifacts and result provenance.
RAG Evaluation Playground — Visualize, compare, and evaluate retrieval performance across different chunking strategies, embeddings, and reranking approaches.
QPP for Clarification Need Prediction in context-grounded multi-turn Conversation. Clean implementations of QPP baselines suitable for multi-turn conversational dataset with ranked documents (opt.). Designed to detect ambiguous search queries.
RAG evaluation framework: hit-rate, MRR, faithfulness scoring, and async batch evaluation with golden question datasets
Add a description, image, and links to the retrieval-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the retrieval-evaluation topic, visit your repo's landing page and select "manage topics."