Spaces:
Running
Running
### RAG Demo: AI-Powered Document Search with Generative Response | |
This project showcases a Retrieval-Augmented Generation (RAG) implementation using | |
SentenceTransformer for semantic search and GPT-2 (or a similar generative model) | |
for response generation. The system combines the power of semantic search with AI-driven text generation, | |
providing relevant answers based on a collection of text documents. | |
## Project Overview | |
The Chagu RAG Demo aims to solve the problem of efficient document retrieval and provide contextual | |
responses using Generative AI. It supports secure document search and offers additional protection | |
against malicious queries using semantic analysis. The project is built with the following goals: | |
# Semantic Search: Retrieve the most relevant documents based on user queries using embeddings. | |
# Generative AI Response: Generate a coherent and context-aware answer using a pre-trained text generation model. | |
# Anomaly Detection: Detect potentially harmful queries (e.g., SQL injections) and block them. | |
### Features | |
# Embedding-based Document Ingestion: Efficiently process and store text document embeddings in a local SQLite database. | |
# Semantic Search: Uses cosine similarity with SentenceTransformer embeddings for accurate information retrieval. | |
# Text Generation: Leverages GPT-2 or distilgpt2 for generating responses based on the retrieved context. | |
# Security: Includes basic query validation to prevent malicious input (e.g., SQL injection detection). | |
Technologies Used | |
SentenceTransformer: For generating semantic embeddings of text documents. | |
Transformers: Provides the generative model (e.g., we have a wide range of models here: https://huggingface.co/models?sort=trending&search=distilgpt2). | |
SQLite: A lightweight database for storing embeddings and document content. | |
Scikit-learn: Used for calculating cosine similarity. | |
NumPy: Efficient numerical operations. | |
Installation | |
Clone the Repository: | |
bash | |
``` | |
git clone https://github.com/yourusername/chagu-rag-demo.git | |
cd chagu-rag-demo | |
``` | |
Create a Virtual Environment: | |
bash | |
``` | |
python3 -m venv .venv | |
source .venv/bin/activate | |
``` | |
Install Dependencies: | |
bash | |
``` | |
pip install -r requirements.txt | |
``` | |
Authenticate with Hugging Face (if needed): | |
bash | |
``` | |
huggingface-cli login | |
``` | |
Setup and Dataset | |
Download and Prepare the Dataset: | |
You can use the IMDB Movie Reviews dataset or any other text files. | |
Place your .txt files in the documents/ directory or specify a custom path. | |
Ingest Files: | |
The script will process all .txt files in the specified directory and store embeddings in a local SQLite database. | |
bash | |
``` | |
python embededGeneratorRAG.py | |
``` | |
Usage | |
Ingest Documents | |
Ingest .txt files from the documents/ directory: | |
python | |
``` | |
embedding_generator = EmbeddingGenerator() | |
embedding_generator.ingest_files("documents") | |
``` | |
Perform a Search Query | |
Run a semantic search query and generate a response: | |
python | |
``` | |
query = "How can I secure my database against SQL injection?" | |
response = embedding_generator.find_most_similar_and_generate(query) | |
print("Generated Response:") | |
print(response) | |
``` | |
Example Output | |
sql | |
``` | |
Generated Response: | |
To prevent SQL injection, you should use prepared statements and parameterized queries. | |
Avoid constructing SQL queries directly using user input. | |
``` | |
File Structure | |
bash | |
``` | |
chagu-rag-demo/ | |
βββ embeddings.db # SQLite database for storing embeddings | |
βββ documents/ # Directory containing .txt files for ingestion | |
βββ rag_chagu_demo.py # Main script with RAG implementation | |
βββ embededGeneratorRAG.py # Core Embedding Generator class | |
βββ requirements.txt # Python dependencies | |
βββ README.md # Project documentation | |
Configuration | |
``` | |
You can update the following configurations in the EmbeddingGenerator class: | |
Model Names: Change model_name or gen_model to use different embedding or generative models. | |
Database Path: Specify a custom path for the SQLite database. | |
python | |
``` | |
embedding_generator = EmbeddingGenerator(model_name="all-MiniLM-L6-v2", gen_model="distilgpt2", db_path="custom_embeddings.db") | |
``` | |
### Potential Improvements | |
FAISS Integration for Scalability: | |
Replace the current SQLite-based retrieval with FAISS for efficient and scalable vector search. | |
Enhanced Security: | |
Implement more robust query validation using a fine-tuned BERT model to detect harmful or suspicious inputs. | |
Deployment on Hugging Face Spaces: | |
Create an interactive demo using Streamlit or Gradio for showcasing the project on Hugging Face Spaces. | |
Known Issues | |
Input Truncation Warning: If the input text is too long, you may see a warning about truncation. This is handled using truncation=True, but it may affect very long queries. | |
Model Availability: Ensure you are using a publicly available model from Hugging Face. If you encounter a 404 Not Found error, check the model identifier. | |
## Contributing | |
Contributions are welcome! Please open an issue or submit a pull request if you would like to improve the project. | |
## Fork the repository. | |
Create a new feature branch. | |
Submit your changes via a pull request. | |
License | |
This project is licensed under the MIT License - see the LICENSE file for details. | |
## Acknowledgments | |
Hugging Face for the amazing models and NLP tools. | |
Scikit-learn for efficient similarity computation. | |
SQLite for providing a lightweight database solution. |