Spaces:
Sleeping
Sleeping
talexm
commited on
Commit
Β·
aeb8626
1
Parent(s):
595bead
descriptive doc
Browse files
README.md
CHANGED
@@ -11,5 +11,98 @@ license: mit
|
|
11 |
short_description: 'this is demo for chain guard protocol, assistant, RAG '
|
12 |
---
|
13 |
|
14 |
-
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
short_description: 'this is demo for chain guard protocol, assistant, RAG '
|
12 |
---
|
13 |
|
14 |
+
# **AI-Powered Document Search with Malicious Query Detection**
|
15 |
+
|
16 |
+
This project implements a semantic search engine for documents using **AI-based malicious query detection**. It allows users to search through movie reviews (IMDB dataset) and additional `.txt` files, while also identifying and blocking potential malicious queries using a pre-trained NLP model.
|
17 |
+
|
18 |
+
## **Features**
|
19 |
+
- **Semantic Search**: Uses fuzzy matching for normal queries, allowing context-aware searches.
|
20 |
+
- **AI-Based Malicious Query Detection**: Utilizes a pre-trained NLP model (`DistilBERT`) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries.
|
21 |
+
- **Flexible Document Ingestion**: Supports loading documents from the IMDB dataset and additional `.txt` files.
|
22 |
+
- **Efficient Path Handling**: Automatically handles dataset paths using the `HOME` environment variable.
|
23 |
+
|
24 |
+
## **Technologies Used**
|
25 |
+
- **Python 3.8+**
|
26 |
+
- **Transformers**: For NLP-based malicious query detection.
|
27 |
+
- **Hugging Face Pipeline**: Uses the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis.
|
28 |
+
- **Pathlib**: For robust file and path handling.
|
29 |
+
|
30 |
+
## **Project Structure**
|
31 |
+
βββ rag_chagu_demo.py # Main script containing the DocumentSearcher class
|
32 |
+
βββ README.md # This file
|
33 |
+
βββ data-sets/ - this part shifted to $HOME
|
34 |
+
β βββ aclImdb/
|
35 |
+
β β βββ train/
|
36 |
+
β β β βββ pos/ # Positive movie reviews
|
37 |
+
β β β βββ neg/ # Negative movie reviews
|
38 |
+
β βββ txt-files/ # Additional .txt files for document search
|
39 |
+
|
40 |
+
|
41 |
+
## **Installation**
|
42 |
+
Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies:
|
43 |
+
|
44 |
+
```bash
|
45 |
+
pip install transformers
|
46 |
+
```
|
47 |
+
Dataset Setup
|
48 |
+
Place the IMDB dataset in the following structure:
|
49 |
+
|
50 |
+
bash
|
51 |
+
Copy code
|
52 |
+
$HOME/data-sets/aclImdb/train/pos/
|
53 |
+
$HOME/data-sets/aclImdb/train/neg/
|
54 |
+
Optionally, place additional .txt files under:
|
55 |
+
|
56 |
+
bash
|
57 |
+
Copy code
|
58 |
+
$HOME/data-sets/txt-files/
|
59 |
+
Usage
|
60 |
+
Run the script with the following command:
|
61 |
+
|
62 |
+
bash
|
63 |
+
```
|
64 |
+
python rag_chagu_demo.py
|
65 |
+
```
|
66 |
+
Example Output
|
67 |
+
```
|
68 |
+
|
69 |
+
Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos
|
70 |
+
Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg
|
71 |
+
Loaded 5000 movie reviews from IMDB dataset.
|
72 |
+
|
73 |
+
Normal Query Results:
|
74 |
+
Document: This movie had great acting and a compelling storyline. The characters were well-developed...
|
75 |
+
|
76 |
+
Malicious Query Detected - Confidence: 0.95
|
77 |
+
Malicious Query Results:
|
78 |
+
|
79 |
+
Document: ANOMALY: Query blocked due to detected malicious intent.
|
80 |
+
|
81 |
+
```
|
82 |
+
## How It Works
|
83 |
+
The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents.
|
84 |
+
The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis.
|
85 |
+
If a query is flagged as malicious, it is blocked and an anomaly message is returned.
|
86 |
+
For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches.
|
87 |
+
AI Model Used
|
88 |
+
The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis.
|
89 |
+
|
90 |
+
## Why Use AI for Malicious Query Detection?
|
91 |
+
Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries.
|
92 |
+
|
93 |
+
#### Improvements and Future Work
|
94 |
+
Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results.
|
95 |
+
Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process.
|
96 |
+
Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis.
|
97 |
+
Contributing
|
98 |
+
Feel free to fork this repository and submit pull requests. Contributions are welcome!
|
99 |
+
|
100 |
+
#### License
|
101 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
102 |
+
|
103 |
+
#### Contact
|
104 |
+
For any questions or issues, please contact the project maintainer:
|
105 |
+
|
106 |
+
Name: Talex Maxim
|
107 |
+
Email: taimax13@gmail.com
|
108 |
+
GitHub: taimax13
|