Spaces:
Running
RAG Demo: AI-Powered Document Search with Generative Response
This project showcases a Retrieval-Augmented Generation (RAG) implementation using SentenceTransformer for semantic search and GPT-2 (or a similar generative model) for response generation. The system combines the power of semantic search with AI-driven text generation, providing relevant answers based on a collection of text documents.
Project Overview
The Chagu RAG Demo aims to solve the problem of efficient document retrieval and provide contextual responses using Generative AI. It supports secure document search and offers additional protection against malicious queries using semantic analysis. The project is built with the following goals:
Semantic Search: Retrieve the most relevant documents based on user queries using embeddings.
Generative AI Response: Generate a coherent and context-aware answer using a pre-trained text generation model.
Anomaly Detection: Detect potentially harmful queries (e.g., SQL injections) and block them.
Features
Embedding-based Document Ingestion: Efficiently process and store text document embeddings in a local SQLite database.
Semantic Search: Uses cosine similarity with SentenceTransformer embeddings for accurate information retrieval.
Text Generation: Leverages GPT-2 or distilgpt2 for generating responses based on the retrieved context.
Security: Includes basic query validation to prevent malicious input (e.g., SQL injection detection).
Technologies Used SentenceTransformer: For generating semantic embeddings of text documents. Transformers: Provides the generative model (e.g., we have a wide range of models here: https://huggingface.co/models?sort=trending&search=distilgpt2). SQLite: A lightweight database for storing embeddings and document content. Scikit-learn: Used for calculating cosine similarity. NumPy: Efficient numerical operations.
Installation
Clone the Repository:
bash
git clone https://github.com/yourusername/chagu-rag-demo.git
cd chagu-rag-demo
Create a Virtual Environment:
bash
python3 -m venv .venv
source .venv/bin/activate
Install Dependencies:
bash
pip install -r requirements.txt
Authenticate with Hugging Face (if needed):
bash
huggingface-cli login
Setup and Dataset Download and Prepare the Dataset:
You can use the IMDB Movie Reviews dataset or any other text files. Place your .txt files in the documents/ directory or specify a custom path. Ingest Files:
The script will process all .txt files in the specified directory and store embeddings in a local SQLite database. bash
python embededGeneratorRAG.py
Usage Ingest Documents Ingest .txt files from the documents/ directory:
python
embedding_generator = EmbeddingGenerator()
embedding_generator.ingest_files("documents")
Perform a Search Query Run a semantic search query and generate a response:
python
query = "How can I secure my database against SQL injection?"
response = embedding_generator.find_most_similar_and_generate(query)
print("Generated Response:")
print(response)
Example Output sql
Generated Response:
To prevent SQL injection, you should use prepared statements and parameterized queries.
Avoid constructing SQL queries directly using user input.
File Structure bash
chagu-rag-demo/
βββ embeddings.db # SQLite database for storing embeddings
βββ documents/ # Directory containing .txt files for ingestion
βββ rag_chagu_demo.py # Main script with RAG implementation
βββ embededGeneratorRAG.py # Core Embedding Generator class
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
Configuration
You can update the following configurations in the EmbeddingGenerator class:
Model Names: Change model_name or gen_model to use different embedding or generative models. Database Path: Specify a custom path for the SQLite database.
python
embedding_generator = EmbeddingGenerator(model_name="all-MiniLM-L6-v2", gen_model="distilgpt2", db_path="custom_embeddings.db")
Potential Improvements
FAISS Integration for Scalability:
Replace the current SQLite-based retrieval with FAISS for efficient and scalable vector search. Enhanced Security:
Implement more robust query validation using a fine-tuned BERT model to detect harmful or suspicious inputs. Deployment on Hugging Face Spaces:
Create an interactive demo using Streamlit or Gradio for showcasing the project on Hugging Face Spaces. Known Issues Input Truncation Warning: If the input text is too long, you may see a warning about truncation. This is handled using truncation=True, but it may affect very long queries.
Model Availability: Ensure you are using a publicly available model from Hugging Face. If you encounter a 404 Not Found error, check the model identifier.
Contributing
Contributions are welcome! Please open an issue or submit a pull request if you would like to improve the project.
Fork the repository.
Create a new feature branch. Submit your changes via a pull request. License This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
Hugging Face for the amazing models and NLP tools. Scikit-learn for efficient similarity computation. SQLite for providing a lightweight database solution.