Enterprise Knowledge Base Q&A System (RAG Implementation)

A production-ready Retrieval-Augmented Generation (RAG) application that enables employees to query proprietary enterprise documents through a natural language interface, receiving accurate, grounded, and citation-backed answers — eliminating hallucinations and the limitations of keyword-based enterprise search.

🌐 Live App: https://enterprise-rag-system.streamlit.app

Business Problem

Traditional enterprise search systems are constrained by rigid keyword matching — they return irrelevant results when users fail to use the exact terminology indexed in the system. On the other side, general-purpose Large Language Models (LLMs), while capable of fluent reasoning, are trained on public internet data and have no knowledge of an organization's internal, proprietary, or confidential documents. When queried about such data, they resort to hallucination — generating plausible-sounding but factually incorrect responses.

This application solves both problems simultaneously by combining semantic vector retrieval with controlled LLM generation, anchoring every response exclusively to private enterprise data.

Solution Overview

This system implements a full RAG (Retrieval-Augmented Generation) pipeline consisting of two primary phases:

Ingestion Phase: Proprietary PDF documents are uploaded, parsed, semantically chunked, and transformed into high-dimensional vector embeddings that are stored in a local FAISS vector database.

Query Phase: A user's natural language question is embedded in real-time and compared against the FAISS index via similarity search. The most relevant document segments are retrieved and injected into a strict LLM prompt that prohibits the model from using any knowledge outside the provided context — guaranteeing grounded, hallucination-free responses.

Tech Stack

Layer	Technology	Detail
Language	Python 3.x	Core application language
Frontend	Streamlit	Interactive web UI with session state management
LLM (Generation)	`gemini-3.1-flash-lite-preview`	Low-latency generative model via Google GenAI
Embedding Model	`gemini-embedding-001`	Produces 3072-dimensional dense vector embeddings
Vector Database	FAISS (`IndexFlatL2`)	Facebook AI Similarity Search — in-memory L2 index
PDF Parsing	PyPDF2	Stream-based text extraction from uploaded PDFs
Vector Math	NumPy	Float32 array construction for FAISS compatibility
Styling	Custom CSS + TOML	Enterprise-themed UI with chat bubble differentiation
Deployment	Streamlit Community Cloud	Zero-infrastructure cloud deployment

System Architecture & Workflow

The application operates through a modular, multi-stage pipeline that governs the full lifecycle of enterprise data — from raw document upload to final answer generation.

┌─────────────────────────────────────────────────────────────────┐
│                        INGESTION PHASE                          │
│                                                                 │
│  PDF Upload → PyPDF2 Extraction → Recursive Word Chunking       │
│       → Batch Embedding (gemini-embedding-001, 3072-dim)        │
│               → FAISS IndexFlatL2 Storage                       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                          QUERY PHASE                            │
│                                                                 │
│  User Question → Real-time Query Embedding                      │
│       → FAISS Similarity Search (Top-K = 10)                    │
│           → Context Injection into Strict RAG Prompt            │
│               → gemini-3.1-flash-lite-preview Generation        │
│                   → Grounded Answer + Citation Expander         │
└─────────────────────────────────────────────────────────────────┘

Stage 1: PDF Ingestion & Text Extraction

Uploaded PDF files are processed using PyPDF2.PdfReader. The system iterates through all pages, extracts raw text, and sanitizes it using UTF-8 encoding with errors='ignore' to handle malformed or non-standard characters gracefully. Pages that yield no extractable text are skipped automatically.

Stage 2: Recursive Word-Based Chunking

Extracted text is split into semantically coherent segments using a word-based recursive chunking strategy:

Chunk Size: 800 words per segment
Overlap: 150 words between adjacent chunks

The 150-word overlap is a deliberate architectural choice. It ensures that information spanning a chunk boundary — such as a conclusion that references a premise from the previous segment — is captured in at least one chunk, preserving semantic continuity during retrieval.

Stage 3: Batch-Optimized Embedding Generation

To minimize API latency and reduce the number of round-trip network calls, the system groups chunks into batches of up to 100 and sends each batch in a single API request to the gemini-embedding-001 model. Each chunk is transformed into a 3072-dimensional float32 vector. The resulting embeddings are returned as a flat list and paired with their corresponding text chunks.

Stage 4: FAISS Vector Indexing

The embeddings are stacked into a NumPy float32 array and added to a FAISS IndexFlatL2 index initialized with dimension 3072. This index performs exact Euclidean distance (L2) comparisons, guaranteeing that the nearest neighbours returned are the mathematically most similar vectors — ensuring retrieval precision without approximation error. The text chunks are stored in a parallel in-memory list, indexed identically to their embedding counterparts in FAISS.

Stage 5: Real-Time Query Retrieval

When a user submits a question, it is embedded on-the-fly using the same gemini-embedding-001 model. The resulting 3072-dimensional query vector is passed to index.search(), which returns the Top-10 nearest neighbours by L2 distance. Invalid indices (returned as -1 by FAISS when the index has fewer entries than top_k) are filtered out before the chunks are assembled into a context string.

Stage 6: Grounded Answer Generation (The RAG Loop)

The retrieved chunks are joined with separator markers (---) and injected into a structured prompt sent to gemini-3.1-flash-lite-preview. The prompt enforces five strict rules on the model:

Use only the provided context — no external knowledge.
Do not hallucinate or fabricate information.
Structure responses with paragraphs and bullet points for readability.
Extract maximum relevant detail from the context to fully address the query.
Explicitly state when the context does not contain sufficient information.

The answer is rendered in the Streamlit chat interface alongside a collapsible "View Retrieved Context Sources" expander showing each retrieved chunk numbered by relevance rank, providing complete auditability of the retrieval process.

Project Structure

Enterprise-RAG-QA-System/
│
├── .streamlit/
│   └── config.toml           # Streamlit theme (primary green #0A5C36) and server config
│
├── backend/
│   └── rag_pipeline.py       # EnterpriseRAG class: extraction, chunking, embedding, FAISS, generation
│
├── static/
│   └── style.css             # Custom CSS: chat bubbles, sidebar, typography overrides
│
├── app.py                    # Streamlit entry point: UI, session state, caching, error handling
│
└── requirements.txt          # Production dependencies

File Responsibilities

backend/rag_pipeline.py — The core intelligence layer. Contains the EnterpriseRAG class with four methods:

get_embeddings() — Handles extraction, chunking, and batch embedding. Returns a (chunks, embeddings) tuple.
add_to_index() — Adds validated embeddings to the FAISS index and extends the parallel chunk store.
retrieve_context() — Embeds a query, searches FAISS, and returns the top-k matching text chunks.
generate_answer() — Assembles the strict RAG prompt and calls the Gemini generation model.

app.py — The presentation and orchestration layer. Manages Streamlit session state, the st.cache_data caching decorator, file deduplication logic, sidebar UI, the main chat loop, and all error handling.

static/style.css — Overrides Streamlit's default component styles for chat messages, sidebar, and expanders to apply the enterprise visual identity.

.streamlit/config.toml — Declares the Streamlit theme (light base, primary green #0A5C36) and enables static file serving for CSS injection.

Key Engineering Decisions

1. Content-Hash-Based Caching (`st.cache_data`)

The most expensive operation in this system is the embedding API call. The get_cached_embeddings() function is decorated with @st.cache_data, and critically, it caches on file bytes content — not the filename. This means:

Re-uploading a file with the same name but different content correctly triggers a fresh API call.
Uploading the same file under a different filename is still served from cache.
The cache persists across Streamlit reruns within the same deployment, drastically reducing API costs in multi-user or repeated-upload scenarios.

A temporary EnterpriseRAG instance is used inside the cached function to isolate the embedding step from the global FAISS index, ensuring the cache returns only pure data (chunks, embeddings) without side effects.

2. FAISS Index & Session State Persistence

The EnterpriseRAG engine is instantiated once and stored in st.session_state.rag_engine. This architectural choice ensures that the FAISS index, which is held entirely in memory, is not destroyed and rebuilt on every Streamlit UI interaction or rerun. The chat history and document ingestion flag are similarly persisted in session state, enabling multi-turn conversational continuity without re-processing.

3. Duplicate Document Guard

A st.session_state.processed_files set tracks filenames of all successfully ingested documents. Before processing any uploaded file, the application checks this set and skips files that have already been embedded and added to the FAISS index. This prevents vector duplication — a scenario where the same document is indexed twice, which would cause it to appear twice in retrieval results and artificially inflate its influence on generated answers.

4. Graceful Rate-Limit Handling (429 Errors)

All API-calling code paths — both the ingestion pipeline and the generation step — are wrapped in try/except blocks that specifically detect 429 status codes. Rather than surfacing a raw exception, the UI presents a clear, user-friendly message distinguishing quota exhaustion from unexpected system failures, which is essential for a deployed, multi-user application.

5. `IndexFlatL2` for Exact Retrieval Precision

FAISS offers both exact and approximate nearest-neighbour search indices. This system deliberately uses IndexFlatL2 — an exact L2 search — rather than an approximate index such as IndexIVFFlat. For an enterprise Q&A system where retrieval accuracy is paramount (a missed or wrong chunk directly degrades answer quality), exact search is the correct tradeoff at the document scales typical of enterprise knowledge bases.

UI & Design System

The interface is built with an Enterprise-First aesthetic using a two-layer styling system: Streamlit's native TOML theming for global configuration, and a custom CSS file for component-level overrides.

Theme Configuration (`.streamlit/config.toml`)

Property	Value	Purpose
`base`	`light`	Light mode foundation
`primaryColor`	`#0A5C36`	Enterprise green — buttons, sliders, active elements
`backgroundColor`	`#FFFFFF`	Main content area
`secondaryBackgroundColor`	`#F4F6F8`	Sidebar and secondary surfaces
`textColor`	`#1E293B`	High-contrast dark slate for readability
`font`	`sans serif`	Clean, professional typeface
`enableStaticServing`	`true`	Enables CSS injection from the `static/` directory

Chat Message Differentiation (`static/style.css`)

Visual distinction between conversational roles is enforced through left-border colour coding on a shared card treatment:

User messages: border-left: 4px solid #0A5C36 (enterprise green) on a #F8FAFC background
Assistant messages: border-left: 4px solid #1E88E5 (professional blue) on white

Both message types share a unified card treatment — border-radius: 10px, subtle box-shadow, and a 1px solid #E2E8F0 border — maintaining visual hierarchy without distraction.

System Diagnostics Panel

A dedicated sidebar section provides real-time observability into the RAG engine's state:

Total Vectors in FAISS: Live count sourced directly from index.ntotal, indicating the current depth and breadth of the active knowledge base.
Active LLM: Displays the exact model identifier powering response generation.

Local Setup & Installation

Prerequisites

Python 3.8 or higher
A valid Google Gemini API Key (obtainable from Google AI Studio)

Steps

1. Clone the repository:

git clone https://github.com/Avik-Das-567/Enterprise-RAG-QA-System.git
cd Enterprise-RAG-QA-System

2. Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate        # On Windows: venv\Scripts\activate

3. Install dependencies:

pip install -r requirements.txt

4. Configure your API key:

Create the Streamlit secrets file:

mkdir -p .streamlit
touch .streamlit/secrets.toml

Add your key to .streamlit/secrets.toml:

GEMINI_API_KEY = "your_gemini_api_key_here"

⚠️ Important: Never commit secrets.toml to version control. Add it to your .gitignore.

5. Run the application:

streamlit run app.py

The application will be available at http://localhost:8501.

Usage

In the sidebar, upload one or more proprietary PDF documents.
Click "Process & Ingest Documents" to extract, chunk, embed, and index the content into FAISS.
Once the success message confirms the number of vectors added, use the chat input to ask natural language questions.
Expand "View Retrieved Context Sources" beneath any answer to inspect the exact document segments used to generate it.

Configuration

All core RAG parameters are centralized in rag_pipeline.py and can be tuned without touching application logic:

Parameter	Location	Default	Description
`chunk_size`	`rag_pipeline.py`	`800`	Number of words per text chunk
`overlap`	`rag_pipeline.py`	`150`	Word overlap between adjacent chunks
`batch_size`	`rag_pipeline.py`	`100`	Max chunks per embedding API call
`dimension`	`rag_pipeline.py`	`3072`	Embedding vector dimensionality
`top_k`	`rag_pipeline.py`	`10`	Number of chunks retrieved per query
`embedding_model`	`rag_pipeline.py`	`gemini-embedding-001`	Embedding model identifier
`generation_model`	`rag_pipeline.py`	`gemini-3.1-flash-lite-preview`	Generation model identifier
`primaryColor`	`config.toml`	`#0A5C36`	Streamlit UI primary accent colour

Dependencies

streamlit       # Web application framework and UI components
google-genai    # Google Generative AI SDK (Gemini embedding + generation)
faiss-cpu       # Facebook AI Similarity Search — CPU-optimised vector index
numpy           # Numerical arrays for FAISS-compatible float32 vector operations
PyPDF2          # PDF parsing and page-level text extraction

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.streamlit		.streamlit
backend		backend
static		static
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Enterprise Knowledge Base Q&A System (RAG Implementation)

Table of Contents

Business Problem

Solution Overview

Tech Stack

System Architecture & Workflow

Stage 1: PDF Ingestion & Text Extraction

Stage 2: Recursive Word-Based Chunking

Stage 3: Batch-Optimized Embedding Generation

Stage 4: FAISS Vector Indexing

Stage 5: Real-Time Query Retrieval

Stage 6: Grounded Answer Generation (The RAG Loop)

Project Structure

File Responsibilities

Key Engineering Decisions

1. Content-Hash-Based Caching (st.cache_data)

2. FAISS Index & Session State Persistence

3. Duplicate Document Guard

4. Graceful Rate-Limit Handling (429 Errors)

5. IndexFlatL2 for Exact Retrieval Precision

UI & Design System

Theme Configuration (.streamlit/config.toml)

Chat Message Differentiation (static/style.css)

System Diagnostics Panel

Local Setup & Installation

Prerequisites

Steps

Usage

Configuration

Dependencies

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Content-Hash-Based Caching (`st.cache_data`)

5. `IndexFlatL2` for Exact Retrieval Precision

Theme Configuration (`.streamlit/config.toml`)

Chat Message Differentiation (`static/style.css`)

Packages