talexm commited on
Commit
aeb8626
Β·
1 Parent(s): 595bead

descriptive doc

Browse files
Files changed (1) hide show
  1. README.md +95 -2
README.md CHANGED
@@ -11,5 +11,98 @@ license: mit
11
  short_description: 'this is demo for chain guard protocol, assistant, RAG '
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
- mm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: 'this is demo for chain guard protocol, assistant, RAG '
12
  ---
13
 
14
+ # **AI-Powered Document Search with Malicious Query Detection**
15
+
16
+ This project implements a semantic search engine for documents using **AI-based malicious query detection**. It allows users to search through movie reviews (IMDB dataset) and additional `.txt` files, while also identifying and blocking potential malicious queries using a pre-trained NLP model.
17
+
18
+ ## **Features**
19
+ - **Semantic Search**: Uses fuzzy matching for normal queries, allowing context-aware searches.
20
+ - **AI-Based Malicious Query Detection**: Utilizes a pre-trained NLP model (`DistilBERT`) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries.
21
+ - **Flexible Document Ingestion**: Supports loading documents from the IMDB dataset and additional `.txt` files.
22
+ - **Efficient Path Handling**: Automatically handles dataset paths using the `HOME` environment variable.
23
+
24
+ ## **Technologies Used**
25
+ - **Python 3.8+**
26
+ - **Transformers**: For NLP-based malicious query detection.
27
+ - **Hugging Face Pipeline**: Uses the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis.
28
+ - **Pathlib**: For robust file and path handling.
29
+
30
+ ## **Project Structure**
31
+ β”œβ”€β”€ rag_chagu_demo.py # Main script containing the DocumentSearcher class
32
+ β”œβ”€β”€ README.md # This file
33
+ β”œβ”€β”€ data-sets/ - this part shifted to $HOME
34
+ β”‚ β”œβ”€β”€ aclImdb/
35
+ β”‚ β”‚ β”œβ”€β”€ train/
36
+ β”‚ β”‚ β”‚ β”œβ”€β”€ pos/ # Positive movie reviews
37
+ β”‚ β”‚ β”‚ └── neg/ # Negative movie reviews
38
+ β”‚ └── txt-files/ # Additional .txt files for document search
39
+
40
+
41
+ ## **Installation**
42
+ Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies:
43
+
44
+ ```bash
45
+ pip install transformers
46
+ ```
47
+ Dataset Setup
48
+ Place the IMDB dataset in the following structure:
49
+
50
+ bash
51
+ Copy code
52
+ $HOME/data-sets/aclImdb/train/pos/
53
+ $HOME/data-sets/aclImdb/train/neg/
54
+ Optionally, place additional .txt files under:
55
+
56
+ bash
57
+ Copy code
58
+ $HOME/data-sets/txt-files/
59
+ Usage
60
+ Run the script with the following command:
61
+
62
+ bash
63
+ ```
64
+ python rag_chagu_demo.py
65
+ ```
66
+ Example Output
67
+ ```
68
+
69
+ Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos
70
+ Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg
71
+ Loaded 5000 movie reviews from IMDB dataset.
72
+
73
+ Normal Query Results:
74
+ Document: This movie had great acting and a compelling storyline. The characters were well-developed...
75
+
76
+ Malicious Query Detected - Confidence: 0.95
77
+ Malicious Query Results:
78
+
79
+ Document: ANOMALY: Query blocked due to detected malicious intent.
80
+
81
+ ```
82
+ ## How It Works
83
+ The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents.
84
+ The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis.
85
+ If a query is flagged as malicious, it is blocked and an anomaly message is returned.
86
+ For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches.
87
+ AI Model Used
88
+ The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis.
89
+
90
+ ## Why Use AI for Malicious Query Detection?
91
+ Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries.
92
+
93
+ #### Improvements and Future Work
94
+ Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results.
95
+ Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process.
96
+ Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis.
97
+ Contributing
98
+ Feel free to fork this repository and submit pull requests. Contributions are welcome!
99
+
100
+ #### License
101
+ This project is licensed under the MIT License - see the LICENSE file for details.
102
+
103
+ #### Contact
104
+ For any questions or issues, please contact the project maintainer:
105
+
106
+ Name: Talex Maxim
107
+ Email: taimax13@gmail.com
108
+ GitHub: taimax13