RAG Demo: AI-Powered Document Search with Generative Response

This project showcases a Retrieval-Augmented Generation (RAG) implementation using SentenceTransformer for semantic search and GPT-2 (or a similar generative model) for response generation. The system combines the power of semantic search with AI-driven text generation, providing relevant answers based on a collection of text documents.

Project Overview

The Chagu RAG Demo aims to solve the problem of efficient document retrieval and provide contextual responses using Generative AI. It supports secure document search and offers additional protection against malicious queries using semantic analysis. The project is built with the following goals:

Semantic Search: Retrieve the most relevant documents based on user queries using embeddings.

Generative AI Response: Generate a coherent and context-aware answer using a pre-trained text generation model.

Anomaly Detection: Detect potentially harmful queries (e.g., SQL injections) and block them.

Features

Embedding-based Document Ingestion: Efficiently process and store text document embeddings in a local SQLite database.

Semantic Search: Uses cosine similarity with SentenceTransformer embeddings for accurate information retrieval.

Text Generation: Leverages GPT-2 or distilgpt2 for generating responses based on the retrieved context.

Security: Includes basic query validation to prevent malicious input (e.g., SQL injection detection).

Technologies Used SentenceTransformer: For generating semantic embeddings of text documents. Transformers: Provides the generative model (e.g., we have a wide range of models here: https://huggingface.co/models?sort=trending&search=distilgpt2). SQLite: A lightweight database for storing embeddings and document content. Scikit-learn: Used for calculating cosine similarity. NumPy: Efficient numerical operations.

Installation

Clone the Repository:

bash

git clone https://github.com/yourusername/chagu-rag-demo.git
cd chagu-rag-demo

Create a Virtual Environment:

bash

python3 -m venv .venv
source .venv/bin/activate

Install Dependencies:

bash

pip install -r requirements.txt

Authenticate with Hugging Face (if needed):

bash

huggingface-cli login

Setup and Dataset Download and Prepare the Dataset:

You can use the IMDB Movie Reviews dataset or any other text files. Place your .txt files in the documents/ directory or specify a custom path. Ingest Files:

The script will process all .txt files in the specified directory and store embeddings in a local SQLite database. bash

python embededGeneratorRAG.py

Usage Ingest Documents Ingest .txt files from the documents/ directory:

python

embedding_generator = EmbeddingGenerator()
embedding_generator.ingest_files("documents")

Perform a Search Query Run a semantic search query and generate a response:

python

query = "How can I secure my database against SQL injection?"
response = embedding_generator.find_most_similar_and_generate(query)
print("Generated Response:")
print(response)

Example Output sql

Generated Response:
To prevent SQL injection, you should use prepared statements and parameterized queries.
 Avoid constructing SQL queries directly using user input.

File Structure bash

chagu-rag-demo/
├── embeddings.db             # SQLite database for storing embeddings
├── documents/                # Directory containing .txt files for ingestion
├── rag_chagu_demo.py         # Main script with RAG implementation
├── embededGeneratorRAG.py    # Core Embedding Generator class
├── requirements.txt          # Python dependencies
├── README.md                 # Project documentation
Configuration

You can update the following configurations in the EmbeddingGenerator class:

Model Names: Change model_name or gen_model to use different embedding or generative models. Database Path: Specify a custom path for the SQLite database.

python

embedding_generator = EmbeddingGenerator(model_name="all-MiniLM-L6-v2", gen_model="distilgpt2", db_path="custom_embeddings.db")

Potential Improvements

FAISS Integration for Scalability:

Replace the current SQLite-based retrieval with FAISS for efficient and scalable vector search. Enhanced Security:

Implement more robust query validation using a fine-tuned BERT model to detect harmful or suspicious inputs. Deployment on Hugging Face Spaces:

Create an interactive demo using Streamlit or Gradio for showcasing the project on Hugging Face Spaces. Known Issues Input Truncation Warning: If the input text is too long, you may see a warning about truncation. This is handled using truncation=True, but it may affect very long queries.

Model Availability: Ensure you are using a publicly available model from Hugging Face. If you encounter a 404 Not Found error, check the model identifier.

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you would like to improve the project.

Fork the repository.

Create a new feature branch. Submit your changes via a pull request. License This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Hugging Face for the amazing models and NLP tools. Scikit-learn for efficient similarity computation. SQLite for providing a lightweight database solution.