Spaces:
Sleeping
title: Chagu Demo
emoji: π
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.40.1
app_file: app.py
pinned: false
license: mit
short_description: 'this is demo for chain guard protocol, assistant, RAG '
AI-Powered Document Search with Malicious Query Detection
This project implements a semantic search engine for documents using AI-based malicious query detection. It allows users to search through movie reviews (IMDB dataset) and additional .txt
files, while also identifying and blocking potential malicious queries using a pre-trained NLP model.
Features
- Semantic Search: Uses fuzzy matching for normal queries, allowing context-aware searches.
- AI-Based Malicious Query Detection: Utilizes a pre-trained NLP model (
DistilBERT
) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries. - Flexible Document Ingestion: Supports loading documents from the IMDB dataset and additional
.txt
files. - Efficient Path Handling: Automatically handles dataset paths using the
HOME
environment variable.
Technologies Used
- Python 3.8+
- Transformers: For NLP-based malicious query detection.
- Hugging Face Pipeline: Uses the
distilbert-base-uncased-finetuned-sst-2-english
model for sentiment analysis. - Pathlib: For robust file and path handling.
Project Structure
βββ rag_chagu_demo.py # Main script containing the DocumentSearcher class βββ README.md # This file βββ data-sets/ - this part shifted to $HOME β βββ aclImdb/ β β βββ train/ β β β βββ pos/ # Positive movie reviews β β β βββ neg/ # Negative movie reviews β βββ txt-files/ # Additional .txt files for document search
Installation
Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies:
pip install transformers
Dataset Setup Place the IMDB dataset in the following structure:
bash Copy code $HOME/data-sets/aclImdb/train/pos/ $HOME/data-sets/aclImdb/train/neg/ Optionally, place additional .txt files under:
bash Copy code $HOME/data-sets/txt-files/ Usage Run the script with the following command:
bash
python rag_chagu_demo.py
Example Output
Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos
Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg
Loaded 5000 movie reviews from IMDB dataset.
Normal Query Results:
Document: This movie had great acting and a compelling storyline. The characters were well-developed...
Malicious Query Detected - Confidence: 0.95
Malicious Query Results:
Document: ANOMALY: Query blocked due to detected malicious intent.
How It Works
The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents. The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis. If a query is flagged as malicious, it is blocked and an anomaly message is returned. For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches. AI Model Used The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis.
Why Use AI for Malicious Query Detection?
Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries.
Improvements and Future Work
Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results. Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process. Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis. Contributing Feel free to fork this repository and submit pull requests. Contributions are welcome!
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For any questions or issues, please contact the project maintainer:
Name: Talex Maxim Email: taimax13@gmail.com GitHub: taimax13