chagu-dev / README.md
talexm
descriptive doc
aeb8626
---
title: Chagu Demo
emoji: πŸ“Š
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.40.1
app_file: app.py
pinned: false
license: mit
short_description: 'this is demo for chain guard protocol, assistant, RAG '
---
# **AI-Powered Document Search with Malicious Query Detection**
This project implements a semantic search engine for documents using **AI-based malicious query detection**. It allows users to search through movie reviews (IMDB dataset) and additional `.txt` files, while also identifying and blocking potential malicious queries using a pre-trained NLP model.
## **Features**
- **Semantic Search**: Uses fuzzy matching for normal queries, allowing context-aware searches.
- **AI-Based Malicious Query Detection**: Utilizes a pre-trained NLP model (`DistilBERT`) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries.
- **Flexible Document Ingestion**: Supports loading documents from the IMDB dataset and additional `.txt` files.
- **Efficient Path Handling**: Automatically handles dataset paths using the `HOME` environment variable.
## **Technologies Used**
- **Python 3.8+**
- **Transformers**: For NLP-based malicious query detection.
- **Hugging Face Pipeline**: Uses the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis.
- **Pathlib**: For robust file and path handling.
## **Project Structure**
β”œβ”€β”€ rag_chagu_demo.py # Main script containing the DocumentSearcher class
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ data-sets/ - this part shifted to $HOME
β”‚ β”œβ”€β”€ aclImdb/
β”‚ β”‚ β”œβ”€β”€ train/
β”‚ β”‚ β”‚ β”œβ”€β”€ pos/ # Positive movie reviews
β”‚ β”‚ β”‚ └── neg/ # Negative movie reviews
β”‚ └── txt-files/ # Additional .txt files for document search
## **Installation**
Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies:
```bash
pip install transformers
```
Dataset Setup
Place the IMDB dataset in the following structure:
bash
Copy code
$HOME/data-sets/aclImdb/train/pos/
$HOME/data-sets/aclImdb/train/neg/
Optionally, place additional .txt files under:
bash
Copy code
$HOME/data-sets/txt-files/
Usage
Run the script with the following command:
bash
```
python rag_chagu_demo.py
```
Example Output
```
Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos
Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg
Loaded 5000 movie reviews from IMDB dataset.
Normal Query Results:
Document: This movie had great acting and a compelling storyline. The characters were well-developed...
Malicious Query Detected - Confidence: 0.95
Malicious Query Results:
Document: ANOMALY: Query blocked due to detected malicious intent.
```
## How It Works
The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents.
The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis.
If a query is flagged as malicious, it is blocked and an anomaly message is returned.
For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches.
AI Model Used
The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis.
## Why Use AI for Malicious Query Detection?
Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries.
#### Improvements and Future Work
Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results.
Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process.
Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis.
Contributing
Feel free to fork this repository and submit pull requests. Contributions are welcome!
#### License
This project is licensed under the MIT License - see the LICENSE file for details.
#### Contact
For any questions or issues, please contact the project maintainer:
Name: Talex Maxim
Email: taimax13@gmail.com
GitHub: taimax13