Spaces:
Sleeping
Sleeping
talexm
commited on
Commit
•
e893d68
1
Parent(s):
0c3cda8
adding blockchain logger
Browse files- rag_sec/README.md +251 -11
- rag_sec/backup.py +79 -0
- rag_sec/document_search_system.py +147 -22
- screenshots/Screenshot from 2024-11-30 19-01-31.png +0 -0
rag_sec/README.md
CHANGED
@@ -1,13 +1,43 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:
|
4 |
|
5 |
-
### 1.
|
6 |
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.
|
7 |
|
8 |
---
|
9 |
|
10 |
-
### 2.
|
11 |
- **Purpose**: Classify the query as "bad" or "good."
|
12 |
- **Steps**:
|
13 |
1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
|
@@ -16,7 +46,7 @@ The system follows a well-structured workflow to ensure accurate, secure, and co
|
|
16 |
|
17 |
---
|
18 |
|
19 |
-
### 3.
|
20 |
- **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
|
21 |
- **Steps**:
|
22 |
1. Identify missing context or ambiguous phrasing.
|
@@ -27,7 +57,7 @@ The system follows a well-structured workflow to ensure accurate, secure, and co
|
|
27 |
|
28 |
---
|
29 |
|
30 |
-
### 4.
|
31 |
- **Purpose**: Retrieve relevant data and generate a context-aware response.
|
32 |
- **Steps**:
|
33 |
1. **Document Retrieval**:
|
@@ -40,7 +70,7 @@ The system follows a well-structured workflow to ensure accurate, secure, and co
|
|
40 |
|
41 |
---
|
42 |
|
43 |
-
### 5.
|
44 |
- **Purpose**: Provide a concise and meaningful answer.
|
45 |
- **Steps**:
|
46 |
1. Combine the retrieved documents into a coherent context.
|
@@ -49,9 +79,219 @@ The system follows a well-structured workflow to ensure accurate, secure, and co
|
|
49 |
|
50 |
---
|
51 |
|
52 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
-
#### Input Query:
|
55 |
-
```plaintext
|
56 |
-
"How to improve acting skills?"
|
57 |
-
````
|
|
|
1 |
+
# **Document Search System**
|
2 |
+
|
3 |
+
## **Overview**
|
4 |
+
The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.
|
5 |
+
|
6 |
+
---
|
7 |
+
|
8 |
+
## **Features**
|
9 |
+
1. **Query Classification:**
|
10 |
+
- Detects malicious or inappropriate queries using a sentiment analysis model.
|
11 |
+
- Blocks malicious queries and prevents them from further processing.
|
12 |
+
|
13 |
+
2. **Query Transformation:**
|
14 |
+
- Rephrases or enhances ambiguous queries to improve retrieval accuracy.
|
15 |
+
- Uses rule-based transformations and advanced text-to-text models.
|
16 |
+
|
17 |
+
3. **RAG Pipeline:**
|
18 |
+
- Retrieves top-k documents based on semantic similarity.
|
19 |
+
- Generates context-aware responses using generative models.
|
20 |
+
|
21 |
+
4. **Blockchain Integration (Chagu):**
|
22 |
+
- Logs all stages of query processing into a blockchain for integrity and traceability.
|
23 |
+
- Validates blockchain integrity.
|
24 |
+
|
25 |
+
5. **Neo4j Integration:**
|
26 |
+
- Stores and visualizes relationships between queries, responses, and documents.
|
27 |
+
- Allows detailed querying and visualization of the data flow.
|
28 |
+
|
29 |
+
---
|
30 |
+
|
31 |
+
## **Workflow**
|
32 |
|
33 |
The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:
|
34 |
|
35 |
+
### **1. Input Query**
|
36 |
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.
|
37 |
|
38 |
---
|
39 |
|
40 |
+
### **2. Detection Module**
|
41 |
- **Purpose**: Classify the query as "bad" or "good."
|
42 |
- **Steps**:
|
43 |
1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
|
|
|
46 |
|
47 |
---
|
48 |
|
49 |
+
### **3. Transformation Module**
|
50 |
- **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
|
51 |
- **Steps**:
|
52 |
1. Identify missing context or ambiguous phrasing.
|
|
|
57 |
|
58 |
---
|
59 |
|
60 |
+
### **4. RAG Pipeline**
|
61 |
- **Purpose**: Retrieve relevant data and generate a context-aware response.
|
62 |
- **Steps**:
|
63 |
1. **Document Retrieval**:
|
|
|
70 |
|
71 |
---
|
72 |
|
73 |
+
### **5. Semantic Response Generation**
|
74 |
- **Purpose**: Provide a concise and meaningful answer.
|
75 |
- **Steps**:
|
76 |
1. Combine the retrieved documents into a coherent context.
|
|
|
79 |
|
80 |
---
|
81 |
|
82 |
+
### **6. Logging and Storage**
|
83 |
+
- **Blockchain Logging:**
|
84 |
+
- Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
|
85 |
+
- Ensures data integrity and tamper-proof records.
|
86 |
+
- **Neo4j Storage:**
|
87 |
+
- Relationships between queries, responses, and retrieved documents are stored in Neo4j.
|
88 |
+
- Enables detailed analysis and graph-based visualization.
|
89 |
+
|
90 |
+
---
|
91 |
+
|
92 |
+
## **Neo4j Visualization**
|
93 |
+
|
94 |
+
Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:
|
95 |
+
|
96 |
+
![Neo4j Visualization](../../screenshots/Screenshot_from_2024-11-30_19-01-31.png)
|
97 |
+
|
98 |
+
- **Nodes**:
|
99 |
+
- Query: Represents the user query.
|
100 |
+
- TransformedQuery: Rephrased or improved query.
|
101 |
+
- Document: Relevant documents retrieved based on the query.
|
102 |
+
- Response: The generated response.
|
103 |
+
|
104 |
+
- **Relationships**:
|
105 |
+
- `RETRIEVED`: Links the query to retrieved documents.
|
106 |
+
- `TRANSFORMED_TO`: Links the original query to the transformed query.
|
107 |
+
- `GENERATED`: Links the query to the generated response.
|
108 |
+
|
109 |
+
---
|
110 |
+
|
111 |
+
## **Setup Instructions**
|
112 |
+
1. Clone the repository:
|
113 |
+
```bash
|
114 |
+
git clone https://github.com/your-repo/document-search-system.git
|
115 |
+
```
|
116 |
+
|
117 |
+
Here’s the updated README.md content in proper Markdown format with the embedded image reference:
|
118 |
+
|
119 |
+
markdown
|
120 |
+
|
121 |
+
# **Document Search System**
|
122 |
+
|
123 |
+
## **Overview**
|
124 |
+
The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.
|
125 |
+
|
126 |
+
---
|
127 |
+
|
128 |
+
## **Features**
|
129 |
+
1. **Query Classification:**
|
130 |
+
- Detects malicious or inappropriate queries using a sentiment analysis model.
|
131 |
+
- Blocks malicious queries and prevents them from further processing.
|
132 |
+
|
133 |
+
2. **Query Transformation:**
|
134 |
+
- Rephrases or enhances ambiguous queries to improve retrieval accuracy.
|
135 |
+
- Uses rule-based transformations and advanced text-to-text models.
|
136 |
+
|
137 |
+
3. **RAG Pipeline:**
|
138 |
+
- Retrieves top-k documents based on semantic similarity.
|
139 |
+
- Generates context-aware responses using generative models.
|
140 |
+
|
141 |
+
4. **Blockchain Integration (Chagu):**
|
142 |
+
- Logs all stages of query processing into a blockchain for integrity and traceability.
|
143 |
+
- Validates blockchain integrity.
|
144 |
+
|
145 |
+
5. **Neo4j Integration:**
|
146 |
+
- Stores and visualizes relationships between queries, responses, and documents.
|
147 |
+
- Allows detailed querying and visualization of the data flow.
|
148 |
+
|
149 |
+
---
|
150 |
+
|
151 |
+
## **Workflow**
|
152 |
+
|
153 |
+
The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:
|
154 |
+
|
155 |
+
### **1. Input Query**
|
156 |
+
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.
|
157 |
+
|
158 |
+
---
|
159 |
+
|
160 |
+
### **2. Detection Module**
|
161 |
+
- **Purpose**: Classify the query as "bad" or "good."
|
162 |
+
- **Steps**:
|
163 |
+
1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
|
164 |
+
2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
|
165 |
+
3. If "good," proceed to the **Transformation Module**.
|
166 |
+
|
167 |
+
---
|
168 |
+
|
169 |
+
### **3. Transformation Module**
|
170 |
+
- **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
|
171 |
+
- **Steps**:
|
172 |
+
1. Identify missing context or ambiguous phrasing.
|
173 |
+
2. Transform the query using:
|
174 |
+
- Rule-based transformations for simple fixes.
|
175 |
+
- Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing.
|
176 |
+
3. Pass the transformed query to the **RAG Pipeline**.
|
177 |
+
|
178 |
+
---
|
179 |
+
|
180 |
+
### **4. RAG Pipeline**
|
181 |
+
- **Purpose**: Retrieve relevant data and generate a context-aware response.
|
182 |
+
- **Steps**:
|
183 |
+
1. **Document Retrieval**:
|
184 |
+
- Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`.
|
185 |
+
- Compute semantic similarity between the query and stored documents.
|
186 |
+
- Retrieve the top-k documents relevant to the query.
|
187 |
+
2. **Response Generation**:
|
188 |
+
- Use the retrieved documents as context.
|
189 |
+
- Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response.
|
190 |
+
|
191 |
+
---
|
192 |
+
|
193 |
+
### **5. Semantic Response Generation**
|
194 |
+
- **Purpose**: Provide a concise and meaningful answer.
|
195 |
+
- **Steps**:
|
196 |
+
1. Combine the retrieved documents into a coherent context.
|
197 |
+
2. Generate a response tailored to the query using the generative model.
|
198 |
+
3. Return the response to the user, ensuring clarity and relevance.
|
199 |
+
|
200 |
+
---
|
201 |
+
|
202 |
+
### **6. Logging and Storage**
|
203 |
+
- **Blockchain Logging:**
|
204 |
+
- Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
|
205 |
+
- Ensures data integrity and tamper-proof records.
|
206 |
+
- **Neo4j Storage:**
|
207 |
+
- Relationships between queries, responses, and retrieved documents are stored in Neo4j.
|
208 |
+
- Enables detailed analysis and graph-based visualization.
|
209 |
+
|
210 |
+
---
|
211 |
+
|
212 |
+
## **Neo4j Visualization**
|
213 |
+
|
214 |
+
Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:
|
215 |
+
|
216 |
+
![Neo4j Visualization](./path/to/Screenshot_from_2024-11-30_19-01-31.png)
|
217 |
+
|
218 |
+
- **Nodes**:
|
219 |
+
- Query: Represents the user query.
|
220 |
+
- TransformedQuery: Rephrased or improved query.
|
221 |
+
- Document: Relevant documents retrieved based on the query.
|
222 |
+
- Response: The generated response.
|
223 |
+
|
224 |
+
- **Relationships**:
|
225 |
+
- `RETRIEVED`: Links the query to retrieved documents.
|
226 |
+
- `TRANSFORMED_TO`: Links the original query to the transformed query.
|
227 |
+
- `GENERATED`: Links the query to the generated response.
|
228 |
+
|
229 |
+
---
|
230 |
+
|
231 |
+
## **Setup Instructions**
|
232 |
+
1. Clone the repository:
|
233 |
+
```bash
|
234 |
+
git clone https://github.com/your-repo/document-search-system.git
|
235 |
+
```
|
236 |
+
Install dependencies:
|
237 |
+
|
238 |
+
```bash
|
239 |
+
|
240 |
+
pip install -r requirements.txt
|
241 |
+
```
|
242 |
+
Initialize the Neo4j database:
|
243 |
+
|
244 |
+
Connect to your Neo4j Aura instance.
|
245 |
+
Set up credentials in the code.
|
246 |
+
Load the dataset:
|
247 |
+
|
248 |
+
Place your documents in the dataset directory (e.g., data-sets/aclImdb/train).
|
249 |
+
Run the system:
|
250 |
+
|
251 |
+
```bash
|
252 |
+
|
253 |
+
python document_search_system.py
|
254 |
+
```
|
255 |
+
Neo4j Queries
|
256 |
+
Retrieve All Queries Logged
|
257 |
+
```cypher
|
258 |
+
|
259 |
+
MATCH (q:Query)
|
260 |
+
RETURN q.text AS query, q.timestamp AS timestamp
|
261 |
+
ORDER BY timestamp DESC
|
262 |
+
```
|
263 |
+
|
264 |
+
Visualize Query Relationships
|
265 |
+
```cypher
|
266 |
+
|
267 |
+
MATCH (n)-[r]->(m)
|
268 |
+
RETURN n, r, m
|
269 |
+
Find Documents for a Query
|
270 |
+
|
271 |
+
```
|
272 |
+
|
273 |
+
```cypher
|
274 |
+
|
275 |
+
MATCH (q:Query {text: "How to improve acting skills?"})-[:RETRIEVED]->(d:Document)
|
276 |
+
RETURN d.name AS document_name
|
277 |
+
```
|
278 |
+
|
279 |
+
### Key Technologies
|
280 |
+
Machine Learning Models:
|
281 |
+
distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis.
|
282 |
+
google/flan-t5-small for query transformation.
|
283 |
+
distilgpt2 for response generation.
|
284 |
+
Vector Similarity Search:
|
285 |
+
all-MiniLM-L6-v2 embeddings for document retrieval.
|
286 |
+
Blockchain Logging:
|
287 |
+
Powered by chainguard.blockchain_logger.
|
288 |
+
Graph-Based Storage:
|
289 |
+
Relationships visualized and queried via Neo4j.
|
290 |
+
vbnet
|
291 |
+
|
292 |
+
|
293 |
+
|
294 |
+
|
295 |
+
|
296 |
+
|
297 |
|
|
|
|
|
|
|
|
rag_sec/backup.py
ADDED
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
from pathlib import Path
|
3 |
+
|
4 |
+
from .bad_query_detector import BadQueryDetector
|
5 |
+
from .query_transformer import QueryTransformer
|
6 |
+
from .document_retriver import DocumentRetriever
|
7 |
+
from .senamtic_response_generator import SemanticResponseGenerator
|
8 |
+
|
9 |
+
|
10 |
+
class DocumentSearchSystem:
|
11 |
+
def __init__(self):
|
12 |
+
"""
|
13 |
+
Initializes the DocumentSearchSystem with:
|
14 |
+
- BadQueryDetector for identifying malicious or inappropriate queries.
|
15 |
+
- QueryTransformer for improving or rephrasing queries.
|
16 |
+
- DocumentRetriever for semantic document retrieval.
|
17 |
+
- SemanticResponseGenerator for generating context-aware responses.
|
18 |
+
"""
|
19 |
+
self.detector = BadQueryDetector()
|
20 |
+
self.transformer = QueryTransformer()
|
21 |
+
self.retriever = DocumentRetriever()
|
22 |
+
self.response_generator = SemanticResponseGenerator()
|
23 |
+
|
24 |
+
def process_query(self, query):
|
25 |
+
"""
|
26 |
+
Processes a user query through the following steps:
|
27 |
+
1. Detect if the query is malicious.
|
28 |
+
2. Transform the query if needed.
|
29 |
+
3. Retrieve relevant documents based on the query.
|
30 |
+
4. Generate a response using the retrieved documents.
|
31 |
+
|
32 |
+
:param query: The user query as a string.
|
33 |
+
:return: A dictionary with the status and response or error message.
|
34 |
+
"""
|
35 |
+
if self.detector.is_bad_query(query):
|
36 |
+
return {"status": "rejected", "message": "Query blocked due to detected malicious intent."}
|
37 |
+
|
38 |
+
# Transform the query
|
39 |
+
transformed_query = self.transformer.transform_query(query)
|
40 |
+
print(f"Transformed Query: {transformed_query}")
|
41 |
+
|
42 |
+
# Retrieve relevant documents
|
43 |
+
retrieved_docs = self.retriever.retrieve(transformed_query)
|
44 |
+
if not retrieved_docs:
|
45 |
+
return {"status": "no_results", "message": "No relevant documents found for your query."}
|
46 |
+
|
47 |
+
# Generate a response based on the retrieved documents
|
48 |
+
response = self.response_generator.generate_response(retrieved_docs)
|
49 |
+
return {"status": "success", "response": response}
|
50 |
+
|
51 |
+
|
52 |
+
def test_system():
|
53 |
+
"""
|
54 |
+
Test the DocumentSearchSystem with normal and malicious queries.
|
55 |
+
- Load documents from a dataset directory.
|
56 |
+
- Perform a normal query and display results.
|
57 |
+
- Perform a malicious query to ensure proper blocking.
|
58 |
+
"""
|
59 |
+
# Define the path to the dataset directory
|
60 |
+
home_dir = Path(os.getenv("HOME", "/"))
|
61 |
+
data_dir = home_dir / "data-sets/aclImdb/train"
|
62 |
+
|
63 |
+
# Initialize the system
|
64 |
+
system = DocumentSearchSystem()
|
65 |
+
system.retriever.load_documents(data_dir)
|
66 |
+
|
67 |
+
# Perform a normal query
|
68 |
+
normal_query = "Tell me about great acting performances."
|
69 |
+
print("\nNormal Query Result:")
|
70 |
+
print(system.process_query(normal_query))
|
71 |
+
|
72 |
+
# Perform a malicious query
|
73 |
+
malicious_query = "DROP TABLE users; SELECT * FROM sensitive_data;"
|
74 |
+
print("\nMalicious Query Result:")
|
75 |
+
print(system.process_query(malicious_query))
|
76 |
+
|
77 |
+
|
78 |
+
if __name__ == "__main__":
|
79 |
+
test_system()
|
rag_sec/document_search_system.py
CHANGED
@@ -1,25 +1,123 @@
|
|
1 |
import os
|
2 |
from pathlib import Path
|
|
|
|
|
3 |
|
4 |
-
|
5 |
-
from
|
6 |
-
from .document_retriver import DocumentRetriever
|
7 |
-
from .senamtic_response_generator import SemanticResponseGenerator
|
8 |
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
-
|
|
|
11 |
def __init__(self):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
"""
|
13 |
Initializes the DocumentSearchSystem with:
|
14 |
- BadQueryDetector for identifying malicious or inappropriate queries.
|
15 |
- QueryTransformer for improving or rephrasing queries.
|
16 |
- DocumentRetriever for semantic document retrieval.
|
17 |
- SemanticResponseGenerator for generating context-aware responses.
|
|
|
|
|
18 |
"""
|
19 |
self.detector = BadQueryDetector()
|
20 |
self.transformer = QueryTransformer()
|
21 |
self.retriever = DocumentRetriever()
|
22 |
self.response_generator = SemanticResponseGenerator()
|
|
|
|
|
23 |
|
24 |
def process_query(self, query):
|
25 |
"""
|
@@ -28,6 +126,7 @@ class DocumentSearchSystem:
|
|
28 |
2. Transform the query if needed.
|
29 |
3. Retrieve relevant documents based on the query.
|
30 |
4. Generate a response using the retrieved documents.
|
|
|
31 |
|
32 |
:param query: The user query as a string.
|
33 |
:return: A dictionary with the status and response or error message.
|
@@ -37,43 +136,69 @@ class DocumentSearchSystem:
|
|
37 |
|
38 |
# Transform the query
|
39 |
transformed_query = self.transformer.transform_query(query)
|
40 |
-
|
|
|
|
|
41 |
|
42 |
# Retrieve relevant documents
|
43 |
retrieved_docs = self.retriever.retrieve(transformed_query)
|
44 |
if not retrieved_docs:
|
45 |
return {"status": "no_results", "message": "No relevant documents found for your query."}
|
46 |
|
|
|
|
|
|
|
47 |
# Generate a response based on the retrieved documents
|
48 |
response = self.response_generator.generate_response(retrieved_docs)
|
49 |
-
return {"status": "success", "response": response}
|
50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
-
def test_system():
|
53 |
-
"""
|
54 |
-
Test the DocumentSearchSystem with normal and malicious queries.
|
55 |
-
- Load documents from a dataset directory.
|
56 |
-
- Perform a normal query and display results.
|
57 |
-
- Perform a malicious query to ensure proper blocking.
|
58 |
-
"""
|
59 |
-
# Define the path to the dataset directory
|
60 |
home_dir = Path(os.getenv("HOME", "/"))
|
61 |
data_dir = home_dir / "data-sets/aclImdb/train"
|
62 |
|
63 |
-
# Initialize the system
|
64 |
-
system = DocumentSearchSystem()
|
65 |
-
system.retriever.load_documents(data_dir)
|
66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
67 |
# Perform a normal query
|
68 |
normal_query = "Tell me about great acting performances."
|
69 |
print("\nNormal Query Result:")
|
70 |
-
|
|
|
|
|
|
|
|
|
71 |
|
72 |
# Perform a malicious query
|
73 |
malicious_query = "DROP TABLE users; SELECT * FROM sensitive_data;"
|
74 |
print("\nMalicious Query Result:")
|
75 |
-
|
|
|
|
|
76 |
|
77 |
|
78 |
-
if __name__ == "__main__":
|
79 |
-
test_system()
|
|
|
1 |
import os
|
2 |
from pathlib import Path
|
3 |
+
from chainguard.blockchain_logger import BlockchainLogger
|
4 |
+
from neo4j import GraphDatabase
|
5 |
|
6 |
+
import sys
|
7 |
+
from os import path
|
|
|
|
|
8 |
|
9 |
+
sys.path.append(path.dirname(path.dirname(path.abspath(__file__))))
|
10 |
+
from bad_query_detector import BadQueryDetector
|
11 |
+
from query_transformer import QueryTransformer
|
12 |
+
from document_retriver import DocumentRetriever
|
13 |
+
from senamtic_response_generator import SemanticResponseGenerator
|
14 |
|
15 |
+
|
16 |
+
class DataTransformer:
|
17 |
def __init__(self):
|
18 |
+
"""
|
19 |
+
Initializes a DataTransformer with a blockchain logger instance.
|
20 |
+
"""
|
21 |
+
self.blockchain_logger = BlockchainLogger()
|
22 |
+
|
23 |
+
def secure_transform(self, data):
|
24 |
+
"""
|
25 |
+
Securely transforms the input data by logging it into the blockchain.
|
26 |
+
|
27 |
+
Args:
|
28 |
+
data (dict): The log data or any data to be securely transformed.
|
29 |
+
|
30 |
+
Returns:
|
31 |
+
dict: A dictionary containing the original data, block hash, and blockchain length.
|
32 |
+
"""
|
33 |
+
# Log the data into the blockchain
|
34 |
+
block_details = self.blockchain_logger.log_data(data)
|
35 |
+
|
36 |
+
# Return the block details and blockchain status
|
37 |
+
return {
|
38 |
+
"data": data,
|
39 |
+
**block_details
|
40 |
+
}
|
41 |
+
|
42 |
+
def validate_blockchain(self):
|
43 |
+
"""
|
44 |
+
Validates the integrity of the blockchain.
|
45 |
+
|
46 |
+
Returns:
|
47 |
+
bool: True if the blockchain is valid, False otherwise.
|
48 |
+
"""
|
49 |
+
return self.blockchain_logger.is_blockchain_valid()
|
50 |
+
|
51 |
+
|
52 |
+
class Neo4jHandler:
|
53 |
+
def __init__(self, uri, user, password):
|
54 |
+
"""
|
55 |
+
Initializes a Neo4j handler for storing and querying relationships.
|
56 |
+
"""
|
57 |
+
self.driver = GraphDatabase.driver(uri, auth=(user, password))
|
58 |
+
|
59 |
+
def close(self):
|
60 |
+
self.driver.close()
|
61 |
+
|
62 |
+
def log_relationships(self, query, transformed_query, response, documents):
|
63 |
+
"""
|
64 |
+
Logs the relationships between queries, responses, and documents into Neo4j.
|
65 |
+
"""
|
66 |
+
with self.driver.session() as session:
|
67 |
+
session.write_transaction(self._create_and_link_nodes, query, transformed_query, response, documents)
|
68 |
+
|
69 |
+
@staticmethod
|
70 |
+
def _create_and_link_nodes(tx, query, transformed_query, response, documents):
|
71 |
+
# Create Query node
|
72 |
+
tx.run("MERGE (q:Query {text: $query}) RETURN q", parameters={"query": query})
|
73 |
+
# Create TransformedQuery node
|
74 |
+
tx.run("MERGE (t:TransformedQuery {text: $transformed_query}) RETURN t",
|
75 |
+
parameters={"transformed_query": transformed_query})
|
76 |
+
# Create Response node
|
77 |
+
tx.run("MERGE (r:Response {text: $response}) RETURN r", parameters={"response": response})
|
78 |
+
|
79 |
+
# Link Query to TransformedQuery and Response
|
80 |
+
tx.run(
|
81 |
+
"""
|
82 |
+
MATCH (q:Query {text: $query}), (t:TransformedQuery {text: $transformed_query})
|
83 |
+
MERGE (q)-[:TRANSFORMED_TO]->(t)
|
84 |
+
""", parameters={"query": query, "transformed_query": transformed_query}
|
85 |
+
)
|
86 |
+
tx.run(
|
87 |
+
"""
|
88 |
+
MATCH (q:Query {text: $query}), (r:Response {text: $response})
|
89 |
+
MERGE (q)-[:GENERATED]->(r)
|
90 |
+
""", parameters={"query": query, "response": response}
|
91 |
+
)
|
92 |
+
|
93 |
+
# Create and link Document nodes
|
94 |
+
for doc in documents:
|
95 |
+
tx.run("MERGE (d:Document {name: $doc}) RETURN d", parameters={"doc": doc})
|
96 |
+
tx.run(
|
97 |
+
"""
|
98 |
+
MATCH (q:Query {text: $query}), (d:Document {name: $doc})
|
99 |
+
MERGE (q)-[:RETRIEVED]->(d)
|
100 |
+
""", parameters={"query": query, "doc": doc}
|
101 |
+
)
|
102 |
+
|
103 |
+
|
104 |
+
class DocumentSearchSystem:
|
105 |
+
def __init__(self, neo4j_uri, neo4j_user, neo4j_password):
|
106 |
"""
|
107 |
Initializes the DocumentSearchSystem with:
|
108 |
- BadQueryDetector for identifying malicious or inappropriate queries.
|
109 |
- QueryTransformer for improving or rephrasing queries.
|
110 |
- DocumentRetriever for semantic document retrieval.
|
111 |
- SemanticResponseGenerator for generating context-aware responses.
|
112 |
+
- DataTransformer for blockchain logging of queries and responses.
|
113 |
+
- Neo4jHandler for relationship logging and visualization.
|
114 |
"""
|
115 |
self.detector = BadQueryDetector()
|
116 |
self.transformer = QueryTransformer()
|
117 |
self.retriever = DocumentRetriever()
|
118 |
self.response_generator = SemanticResponseGenerator()
|
119 |
+
self.data_transformer = DataTransformer()
|
120 |
+
self.neo4j_handler = Neo4jHandler(neo4j_uri, neo4j_user, neo4j_password)
|
121 |
|
122 |
def process_query(self, query):
|
123 |
"""
|
|
|
126 |
2. Transform the query if needed.
|
127 |
3. Retrieve relevant documents based on the query.
|
128 |
4. Generate a response using the retrieved documents.
|
129 |
+
5. Log all stages to the blockchain and Neo4j.
|
130 |
|
131 |
:param query: The user query as a string.
|
132 |
:return: A dictionary with the status and response or error message.
|
|
|
136 |
|
137 |
# Transform the query
|
138 |
transformed_query = self.transformer.transform_query(query)
|
139 |
+
|
140 |
+
# Log the original query to the blockchain
|
141 |
+
self.data_transformer.secure_transform({"type": "query", "content": query})
|
142 |
|
143 |
# Retrieve relevant documents
|
144 |
retrieved_docs = self.retriever.retrieve(transformed_query)
|
145 |
if not retrieved_docs:
|
146 |
return {"status": "no_results", "message": "No relevant documents found for your query."}
|
147 |
|
148 |
+
# Log the retrieved documents to the blockchain
|
149 |
+
self.data_transformer.secure_transform({"type": "documents", "content": retrieved_docs})
|
150 |
+
|
151 |
# Generate a response based on the retrieved documents
|
152 |
response = self.response_generator.generate_response(retrieved_docs)
|
|
|
153 |
|
154 |
+
# Log the response to the blockchain
|
155 |
+
blockchain_details = self.data_transformer.secure_transform({"type": "response", "content": response})
|
156 |
+
|
157 |
+
# Log relationships to Neo4j
|
158 |
+
self.neo4j_handler.log_relationships(query, transformed_query, response, retrieved_docs)
|
159 |
+
|
160 |
+
return {
|
161 |
+
"status": "success",
|
162 |
+
"response": response,
|
163 |
+
"retrieved_documents": retrieved_docs,
|
164 |
+
"blockchain_details": blockchain_details
|
165 |
+
}
|
166 |
+
|
167 |
+
def validate_system_integrity(self):
|
168 |
+
"""
|
169 |
+
Validates the integrity of the blockchain.
|
170 |
+
"""
|
171 |
+
return self.data_transformer.validate_blockchain()
|
172 |
+
|
173 |
+
|
174 |
+
if __name__ == "__main__":
|
175 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
176 |
home_dir = Path(os.getenv("HOME", "/"))
|
177 |
data_dir = home_dir / "data-sets/aclImdb/train"
|
178 |
|
|
|
|
|
|
|
179 |
|
180 |
+
# Initialize system with Neo4j credentials
|
181 |
+
system = DocumentSearchSystem(
|
182 |
+
neo4j_uri="neo4j+s://0ca71b10.databases.neo4j.io",
|
183 |
+
neo4j_user="neo4j",
|
184 |
+
neo4j_password="<PINGME ill provide>"
|
185 |
+
)
|
186 |
+
|
187 |
+
system.retriever.load_documents(data_dir)
|
188 |
# Perform a normal query
|
189 |
normal_query = "Tell me about great acting performances."
|
190 |
print("\nNormal Query Result:")
|
191 |
+
result = system.process_query(normal_query)
|
192 |
+
print("Status:", result["status"])
|
193 |
+
print("Response:", result["response"])
|
194 |
+
print("Retrieved Documents:", result["retrieved_documents"])
|
195 |
+
print("Blockchain Details:", result["blockchain_details"])
|
196 |
|
197 |
# Perform a malicious query
|
198 |
malicious_query = "DROP TABLE users; SELECT * FROM sensitive_data;"
|
199 |
print("\nMalicious Query Result:")
|
200 |
+
result = system.process_query(malicious_query)
|
201 |
+
print("Status:", result["status"])
|
202 |
+
print("Message:", result.get("message"))
|
203 |
|
204 |
|
|
|
|
screenshots/Screenshot from 2024-11-30 19-01-31.png
ADDED