Spaces:

vikee
/

chagu-dev

Sleeping

App Files Files Community

talexm commited on Nov 16, 2024

Commit

aeb8626

1 Parent(s): 595bead

descriptive doc

Browse files

Files changed (1) hide show

README.md +95 -2

README.md CHANGED Viewed

@@ -11,5 +11,98 @@ license: mit
 short_description: 'this is demo for chain guard protocol, assistant, RAG '
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-mm

 short_description: 'this is demo for chain guard protocol, assistant, RAG '
 ---
+# **AI-Powered Document Search with Malicious Query Detection**
+This project implements a semantic search engine for documents using **AI-based malicious query detection**. It allows users to search through movie reviews (IMDB dataset) and additional `.txt` files, while also identifying and blocking potential malicious queries using a pre-trained NLP model.
+## **Features**
+- **Semantic Search**: Uses fuzzy matching for normal queries, allowing context-aware searches.
+- **AI-Based Malicious Query Detection**: Utilizes a pre-trained NLP model (`DistilBERT`) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries.
+- **Flexible Document Ingestion**: Supports loading documents from the IMDB dataset and additional `.txt` files.
+- **Efficient Path Handling**: Automatically handles dataset paths using the `HOME` environment variable.
+## **Technologies Used**
+- **Python 3.8+**
+- **Transformers**: For NLP-based malicious query detection.
+- **Hugging Face Pipeline**: Uses the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis.
+- **Pathlib**: For robust file and path handling.
+## **Project Structure**
+├── rag_chagu_demo.py # Main script containing the DocumentSearcher class
+├── README.md # This file
+├── data-sets/  - this part shifted to $HOME
+│ ├── aclImdb/
+│ │ ├── train/
+│ │ │ ├── pos/ # Positive movie reviews
+│ │ │ └── neg/ # Negative movie reviews
+│ └── txt-files/ # Additional .txt files for document search
+## **Installation**
+Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies:
+```bash
+pip install transformers
+```
+Dataset Setup
+Place the IMDB dataset in the following structure:
+bash
+Copy code
+$HOME/data-sets/aclImdb/train/pos/
+$HOME/data-sets/aclImdb/train/neg/
+Optionally, place additional .txt files under:
+bash
+Copy code
+$HOME/data-sets/txt-files/
+Usage
+Run the script with the following command:
+bash
+```
+python rag_chagu_demo.py
+```
+Example Output
+```
+Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos
+Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg
+Loaded 5000 movie reviews from IMDB dataset.
+Normal Query Results:
+Document: This movie had great acting and a compelling storyline. The characters were well-developed...
+Malicious Query Detected - Confidence: 0.95
+Malicious Query Results:
+Document: ANOMALY: Query blocked due to detected malicious intent.
+```
+## How It Works
+The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents.
+The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis.
+If a query is flagged as malicious, it is blocked and an anomaly message is returned.
+For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches.
+AI Model Used
+The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis.
+## Why Use AI for Malicious Query Detection?
+Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries.
+#### Improvements and Future Work
+Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results.
+Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process.
+Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis.
+Contributing
+Feel free to fork this repository and submit pull requests. Contributions are welcome!
+#### License
+This project is licensed under the MIT License - see the LICENSE file for details.
+#### Contact
+For any questions or issues, please contact the project maintainer:
+Name: Talex Maxim
+Email: taimax13@gmail.com
+GitHub: taimax13