📚 Bengali & English RAG Retrieval System

This project implements a Retrieval Augmented Generation (RAG) pipeline designed to answer questions based on a provided PDF document, specifically handling both Bengali and English text. It features robust text extraction, intelligent chunking, hybrid retrieval (semantic and lexical search), and re-ranking to deliver highly relevant context for a Large Language Model (LLM).

🚀 Setup Guide

Follow these steps to set up and run the RAG retrieval system locally.

Prerequisites

Python 3.10 or higher
An OpenAI API key

Installation Steps

Clone the repository:

git clone https://github.com/TanzirR/Multilingual-Retrieval-Augmented-Generation-RAG-System.git
cd Multilingual-Retrieval-Augmented-Generation-RAG-System

Create a virtual environment (recommended):
```
python -m venv venv
```
Activate the virtual environment:
- Windows: .\venv\Scripts\activate
- macOS/Linux: source venv/bin/activate
Install Python dependencies:
```
pip install -r requirements.txt
```
API Setup: Create a file named .env in the root directory of the project. Add your OpenAI API key to this file.
```
OPENAI_API_KEY="your_openai_api_key_here"
```

Running the Pipeline

Running main.py will run the entire RAG pipeline. As the argument, the directory and the name of the pdf is required.

python main.py ./data/document.pdf

🛠️ Used Tools, Libraries, and Packages

This project utilizes the following key tools and Python libraries:

easyocr: A Python library for Optical Character Recognition (OCR) that supports over 80 languages including Bengali and English. Uses deep learning models for accurate text extraction.
pdf2image: Converts PDF pages into PIL Image objects, enabling OCR.
Pillow (PIL): Python Imaging Library, used for image manipulation.
regex: A more powerful regular expression module, used for text cleaning and pattern matching.
json: For reading and writing structured data (page data, chunks with metadata).
numpy: Fundamental package for numerical computation, used with FAISS and EasyOCR.
pickle: For serializing and deserializing Python objects (chunks and metadata).
faiss: Facebook AI Similarity Search library, used for efficient similarity search on embeddings.
sentence-transformers: For generating semantic embeddings (intfloat/multilingual-e5-base) and for re-ranking (cross-encoder/mmarco-mMiniLMv2-L12-H384-v1).
rank_bm25: Implements the BM25 algorithm for lexical (keyword) search.
langchain: Specifically RecursiveCharacterTextSplitter for intelligent text chunking.
streamlit: For building the interactive web-based user interface.
io, sys, time, datetime: Standard Python libraries for system interaction, timing, and date/time handling.

📝 Sample Queries and Outputs

Sample PDF

Streamlit UI

Retrieval Analysis

📊 Evaluation Matrix

A formal, automated evaluation matrix is not explicitly implemented within this codebase. However, the Streamlit UI provides several metrics for qualitative analysis and debugging of the retrieval performance:

Re-rank Score: The final score after Cross-Encoder re-ranking, indicating the overall relevance.
Initial Hybrid Score: The combined score from semantic and BM25 search before re-ranking.
Semantic Score: The cosine similarity score from the FAISS vector search.
BM25 Score: The lexical similarity score from the BM25 algorithm.
Chunk Index: The original index of the chunk.
Text Preview & Full Content: Allows visual inspection of the retrieved text.
Metadata: Provides context like page_range, type, segment_id, etc.
Keywords Found: Shows overlapping keywords between the query and the chunk.
Retrieval Time: Measures the time taken for the retrieval process.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
data		data
README.md		README.md
chunk.py		chunk.py
embedding.py		embedding.py
main.py		main.py
requirements.txt		requirements.txt
retrieve.py		retrieve.py
ui_updated.py		ui_updated.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Bengali & English RAG Retrieval System

🚀 Setup Guide

Prerequisites

Installation Steps

Running the Pipeline

🛠️ Used Tools, Libraries, and Packages

📝 Sample Queries and Outputs

📊 Evaluation Matrix

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📚 Bengali & English RAG Retrieval System

🚀 Setup Guide

Prerequisites

Installation Steps

Running the Pipeline

🛠️ Used Tools, Libraries, and Packages

📝 Sample Queries and Outputs

📊 Evaluation Matrix

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages