talexm
adding LLM for RAg
fdc732d
### RAG Demo: AI-Powered Document Search with Generative Response
This project showcases a Retrieval-Augmented Generation (RAG) implementation using
SentenceTransformer for semantic search and GPT-2 (or a similar generative model)
for response generation. The system combines the power of semantic search with AI-driven text generation,
providing relevant answers based on a collection of text documents.
## Project Overview
The Chagu RAG Demo aims to solve the problem of efficient document retrieval and provide contextual
responses using Generative AI. It supports secure document search and offers additional protection
against malicious queries using semantic analysis. The project is built with the following goals:
# Semantic Search: Retrieve the most relevant documents based on user queries using embeddings.
# Generative AI Response: Generate a coherent and context-aware answer using a pre-trained text generation model.
# Anomaly Detection: Detect potentially harmful queries (e.g., SQL injections) and block them.
### Features
# Embedding-based Document Ingestion: Efficiently process and store text document embeddings in a local SQLite database.
# Semantic Search: Uses cosine similarity with SentenceTransformer embeddings for accurate information retrieval.
# Text Generation: Leverages GPT-2 or distilgpt2 for generating responses based on the retrieved context.
# Security: Includes basic query validation to prevent malicious input (e.g., SQL injection detection).
Technologies Used
SentenceTransformer: For generating semantic embeddings of text documents.
Transformers: Provides the generative model (e.g., we have a wide range of models here: https://huggingface.co/models?sort=trending&search=distilgpt2).
SQLite: A lightweight database for storing embeddings and document content.
Scikit-learn: Used for calculating cosine similarity.
NumPy: Efficient numerical operations.
Installation
Clone the Repository:
bash
```
git clone https://github.com/yourusername/chagu-rag-demo.git
cd chagu-rag-demo
```
Create a Virtual Environment:
bash
```
python3 -m venv .venv
source .venv/bin/activate
```
Install Dependencies:
bash
```
pip install -r requirements.txt
```
Authenticate with Hugging Face (if needed):
bash
```
huggingface-cli login
```
Setup and Dataset
Download and Prepare the Dataset:
You can use the IMDB Movie Reviews dataset or any other text files.
Place your .txt files in the documents/ directory or specify a custom path.
Ingest Files:
The script will process all .txt files in the specified directory and store embeddings in a local SQLite database.
bash
```
python embededGeneratorRAG.py
```
Usage
Ingest Documents
Ingest .txt files from the documents/ directory:
python
```
embedding_generator = EmbeddingGenerator()
embedding_generator.ingest_files("documents")
```
Perform a Search Query
Run a semantic search query and generate a response:
python
```
query = "How can I secure my database against SQL injection?"
response = embedding_generator.find_most_similar_and_generate(query)
print("Generated Response:")
print(response)
```
Example Output
sql
```
Generated Response:
To prevent SQL injection, you should use prepared statements and parameterized queries.
Avoid constructing SQL queries directly using user input.
```
File Structure
bash
```
chagu-rag-demo/
β”œβ”€β”€ embeddings.db # SQLite database for storing embeddings
β”œβ”€β”€ documents/ # Directory containing .txt files for ingestion
β”œβ”€β”€ rag_chagu_demo.py # Main script with RAG implementation
β”œβ”€β”€ embededGeneratorRAG.py # Core Embedding Generator class
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ README.md # Project documentation
Configuration
```
You can update the following configurations in the EmbeddingGenerator class:
Model Names: Change model_name or gen_model to use different embedding or generative models.
Database Path: Specify a custom path for the SQLite database.
python
```
embedding_generator = EmbeddingGenerator(model_name="all-MiniLM-L6-v2", gen_model="distilgpt2", db_path="custom_embeddings.db")
```
### Potential Improvements
FAISS Integration for Scalability:
Replace the current SQLite-based retrieval with FAISS for efficient and scalable vector search.
Enhanced Security:
Implement more robust query validation using a fine-tuned BERT model to detect harmful or suspicious inputs.
Deployment on Hugging Face Spaces:
Create an interactive demo using Streamlit or Gradio for showcasing the project on Hugging Face Spaces.
Known Issues
Input Truncation Warning: If the input text is too long, you may see a warning about truncation. This is handled using truncation=True, but it may affect very long queries.
Model Availability: Ensure you are using a publicly available model from Hugging Face. If you encounter a 404 Not Found error, check the model identifier.
## Contributing
Contributions are welcome! Please open an issue or submit a pull request if you would like to improve the project.
## Fork the repository.
Create a new feature branch.
Submit your changes via a pull request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
Hugging Face for the amazing models and NLP tools.
Scikit-learn for efficient similarity computation.
SQLite for providing a lightweight database solution.