A production-ready Retrieval-Augmented Generation (RAG) pipeline that lets you upload PDF documents and get accurate, cited answers to natural-language questions — with built-in duplicate prevention and optional result reranking.
Imagine uploading your company's 200-page policy manual and then simply asking "What is the leave encashment policy?" — and getting a precise, cited answer back in seconds.
This system makes that possible. It:
- Ingests PDF documents, splits them into smart chunks, and stores them as searchable vectors.
- Deduplicates automatically — the same file uploaded twice is silently skipped, saving cost and time.
- Retrieves the most relevant text chunks for any question using semantic search.
- Reranks those chunks with a second-pass precision model for sharper relevance.
- Generates a clean, cited answer using a large language model.
graph TD
subgraph Ingestion Pipeline
A[Upload PDF via /ingest] --> B[Select Namespace]
B --> C[Compute SHA-256 Hash]
C --> D{Duplicate Exists?}
D -->|Yes| E[Skip — Already Indexed]
D -->|No| F[Extract Text with PyMuPDF]
F --> G[Clean & Preprocess Text]
G --> H[Chunk — 500 tokens, 60-token overlap]
H --> I[Attach Metadata: page, source, hash, namespace]
I --> J[Embed & Upsert to Pinecone]
end
subgraph Query Pipeline
K[User Query via /query] --> L[Select Namespace]
L --> M[Validate Request]
M --> N{Rerank Enabled?}
N -->|No| O[Semantic Retrieval — Top 5 Chunks]
N -->|Yes| P[Retrieve + Rerank — Top 4 Chunks]
O --> Q[Build Context with Citations]
P --> Q
Q --> R[LLM Generation via openai/gpt-oss-120b]
R --> S[Answer with Inline Citations]
end
J -. Namespace-Isolated Storage .- O
J -. Namespace-Isolated Storage .- P
| What Was Built | Why It Matters |
|---|---|
| SHA-256 deduplication layer | Prevents redundant vector upserts; cuts embedding API costs on repeat ingestions |
| Pinecone native reranker | Boosts precision of top retrieved chunks — fewer irrelevant results reach the LLM |
| Page-level citation metadata | Every LLM answer is traceable to an exact page, making the system auditable and trustworthy |
| Namespace isolation | Multiple document collections can coexist without cross-contamination |
| Async-ready FastAPI backend | Handles concurrent requests and scales horizontally |
Problem: Re-uploading the same PDF created redundant vectors, inflating storage costs and degrading retrieval quality.
Solution: Computed a SHA-256 hash of each document at ingest time and stored it as metadata in Pinecone. Before any chunking or embedding, the system checks if the hash already exists in the target namespace and short-circuits if so.
Problem: Vector cosine similarity alone sometimes surfaces chunks that are topically adjacent but not the best answer.
Solution: Integrated Pinecone's native reranking model as a post-retrieval step. Retrieved top-10 candidates are re-scored by a cross-encoder and narrowed to the top 4, dramatically improving answer precision.
Problem: Naively splitting text at fixed character counts broke sentences and paragraphs across PDF pages, losing context.
Solution: Implemented page-aware chunking with a 60-token overlap window, ensuring that each chunk retains enough adjacent context and that no key sentence is silently dropped at a boundary.
Problem: LLMs can generate plausible-sounding but fabricated information with no way to trace the source.
Solution: Every chunk upserted into Pinecone carries structured metadata (source, page, namespace). The retrieval step passes this metadata alongside the text into the LLM prompt, which is instructed to cite its sources. The final response contains inline [source: filename, page N] citations.
- RAG is only as good as its chunking strategy. Chunk size and overlap profoundly impact retrieval quality — too small loses context, too large dilutes relevance scores.
- Reranking is a high-leverage, low-cost upgrade. Adding a second-pass reranker on top of ANN retrieval consistently beats pure semantic search with minimal added latency.
- Metadata is a first-class citizen. Treating citations as a structural requirement — not an afterthought — forces cleaner ingestion design and makes the system genuinely production-trustworthy.
- Namespace isolation unlocks multi-tenancy. Designing around namespaces from day one means the system can serve multiple clients or document domains without re-architecting.
- Cost visibility matters at scale. Deduplication isn't just an operational nicety — at scale, redundant embeddings become a real API cost line item.
- Async ingestion queue — Offload chunking and embedding to Celery/RabbitMQ workers so large multi-document uploads don't block the API thread.
- Hybrid search (BM25 + dense) — Combine keyword search with vector search for documents that contain exact codes, IDs, or jargon that semantic search may miss.
- Document versioning — Track when a document is updated so the system can replace stale vectors rather than requiring a manual re-ingest.
- Evaluation — Integrate
ragasor a custom eval loop to measure faithfulness, answer relevancy, and context precision automatically on every code change. - Multi-modal support — Extend ingestion to handle tables, diagrams, and scanned PDFs.
| Layer | Technology |
|---|---|
| API Framework | |
| Vector Database | |
| Embedding Model | |
| Reranker | |
| LLM | |
| PDF Processing | |
| Config Management | |
| Deduplication |
- Python 3.11+
- Pinecone account with an active API key
- OpenAI API key
# 1. Clone the repository
git clone https://github.com/questinrest/rag-pipeline-reranker
cd rag-pipeline-reranker
# 2. Create a virtual environment
py -3.11 -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
# 3. Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
PINECONE_API_KEY=<your-pinecone-api-key>
OPENAI_API_KEY=<your-openai-api-key>Configuration is validated at startup via Pydantic Settings (
src/config.py), so missing or malformed keys raise an explicit error rather than a silent runtime failure.
cd code
uvicorn src.api:app --reload- API base URL:
http://127.0.0.1:8000 - Interactive Swagger docs:
http://127.0.0.1:8000/docs
Parses a PDF, chunks it, embeds it, and stores the vectors in Pinecone. Silently skips files that have already been ingested (deduplication via SHA-256).
{
"file_path": "C:/path/to/document.pdf"
}Response: Confirmation message with the number of chunks upserted, or a notice that the document was already indexed.
Runs a semantic search over the indexed documents and returns an LLM-generated answer with source citations.
{
"query": "What are the rules for employee onboarding?",
"rerank": true
}rerank: true activates the precision reranking pass (recommended for most use cases).
Response: A natural-language answer with inline page-level citations.
precision-rag-with-deduplication/
├── code/
│ └── src/
│ ├── api.py # FastAPI route definitions
│ ├── config.py # Pydantic settings & env loading
│ ├── data_models.py # Request/response Pydantic schemas
│ ├── ingestion.py # PDF extraction, chunking, hashing
│ ├── embedding.py # Pinecone upsert & embed logic
│ ├── retrieval.py # Vector similarity search
│ ├── reranker.py # Post-retrieval reranking pass
│ ├── generation.py # LLM prompt construction & call
│ └── utils.py # Shared helper utilities
├── docs/ # Sample PDFs for testing
├── requirements.txt
├── .env # Local secrets (not committed)
└── README.md
MIT License — feel free to fork, extend, and build on top of this system.