Spaces:
Running
Running
File size: 10,176 Bytes
e893d68 9a25cef eb579c5 9a25cef e893d68 eb579c5 e4a2031 eb579c5 e4a2031 e893d68 eb579c5 e4a2031 e893d68 eb579c5 e4a2031 eb579c5 e4a2031 e893d68 eb579c5 e4a2031 eb579c5 e4a2031 e893d68 eb579c5 e4a2031 9a25cef e893d68 e4a2031 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 |
# **Document Search System**
## **Overview**
The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.
---
## **Features**
1. **Query Classification:**
- Detects malicious or inappropriate queries using a sentiment analysis model.
- Blocks malicious queries and prevents them from further processing.
2. **Query Transformation:**
- Rephrases or enhances ambiguous queries to improve retrieval accuracy.
- Uses rule-based transformations and advanced text-to-text models.
3. **RAG Pipeline:**
- Retrieves top-k documents based on semantic similarity.
- Generates context-aware responses using generative models.
4. **Blockchain Integration (Chagu):**
- Logs all stages of query processing into a blockchain for integrity and traceability.
- Validates blockchain integrity.
5. **Neo4j Integration:**
- Stores and visualizes relationships between queries, responses, and documents.
- Allows detailed querying and visualization of the data flow.
---
## **Workflow**
The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:
### **1. Input Query**
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.
---
### **2. Detection Module**
- **Purpose**: Classify the query as "bad" or "good."
- **Steps**:
1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
3. If "good," proceed to the **Transformation Module**.
---
### **3. Transformation Module**
- **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
- **Steps**:
1. Identify missing context or ambiguous phrasing.
2. Transform the query using:
- Rule-based transformations for simple fixes.
- Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing.
3. Pass the transformed query to the **RAG Pipeline**.
---
### **4. RAG Pipeline**
- **Purpose**: Retrieve relevant data and generate a context-aware response.
- **Steps**:
1. **Document Retrieval**:
- Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`.
- Compute semantic similarity between the query and stored documents.
- Retrieve the top-k documents relevant to the query.
2. **Response Generation**:
- Use the retrieved documents as context.
- Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response.
---
### **5. Semantic Response Generation**
- **Purpose**: Provide a concise and meaningful answer.
- **Steps**:
1. Combine the retrieved documents into a coherent context.
2. Generate a response tailored to the query using the generative model.
3. Return the response to the user, ensuring clarity and relevance.
---
### **6. Logging and Storage**
- **Blockchain Logging:**
- Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
- Ensures data integrity and tamper-proof records.
- **Neo4j Storage:**
- Relationships between queries, responses, and retrieved documents are stored in Neo4j.
- Enables detailed analysis and graph-based visualization.
---
## **Neo4j Visualization**
Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:
![Neo4j Visualization](../../screenshots/Screenshot_from_2024-11-30_19-01-31.png)
- **Nodes**:
- Query: Represents the user query.
- TransformedQuery: Rephrased or improved query.
- Document: Relevant documents retrieved based on the query.
- Response: The generated response.
- **Relationships**:
- `RETRIEVED`: Links the query to retrieved documents.
- `TRANSFORMED_TO`: Links the original query to the transformed query.
- `GENERATED`: Links the query to the generated response.
---
## **Setup Instructions**
1. Clone the repository:
```bash
git clone https://github.com/your-repo/document-search-system.git
```
Here’s the updated README.md content in proper Markdown format with the embedded image reference:
markdown
# **Document Search System**
## **Overview**
The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.
---
## **Features**
1. **Query Classification:**
- Detects malicious or inappropriate queries using a sentiment analysis model.
- Blocks malicious queries and prevents them from further processing.
2. **Query Transformation:**
- Rephrases or enhances ambiguous queries to improve retrieval accuracy.
- Uses rule-based transformations and advanced text-to-text models.
3. **RAG Pipeline:**
- Retrieves top-k documents based on semantic similarity.
- Generates context-aware responses using generative models.
4. **Blockchain Integration (Chagu):**
- Logs all stages of query processing into a blockchain for integrity and traceability.
- Validates blockchain integrity.
5. **Neo4j Integration:**
- Stores and visualizes relationships between queries, responses, and documents.
- Allows detailed querying and visualization of the data flow.
---
## **Workflow**
The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:
### **1. Input Query**
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.
---
### **2. Detection Module**
- **Purpose**: Classify the query as "bad" or "good."
- **Steps**:
1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
3. If "good," proceed to the **Transformation Module**.
---
### **3. Transformation Module**
- **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
- **Steps**:
1. Identify missing context or ambiguous phrasing.
2. Transform the query using:
- Rule-based transformations for simple fixes.
- Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing.
3. Pass the transformed query to the **RAG Pipeline**.
---
### **4. RAG Pipeline**
- **Purpose**: Retrieve relevant data and generate a context-aware response.
- **Steps**:
1. **Document Retrieval**:
- Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`.
- Compute semantic similarity between the query and stored documents.
- Retrieve the top-k documents relevant to the query.
2. **Response Generation**:
- Use the retrieved documents as context.
- Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response.
---
### **5. Semantic Response Generation**
- **Purpose**: Provide a concise and meaningful answer.
- **Steps**:
1. Combine the retrieved documents into a coherent context.
2. Generate a response tailored to the query using the generative model.
3. Return the response to the user, ensuring clarity and relevance.
---
### **6. Logging and Storage**
- **Blockchain Logging:**
- Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
- Ensures data integrity and tamper-proof records.
- **Neo4j Storage:**
- Relationships between queries, responses, and retrieved documents are stored in Neo4j.
- Enables detailed analysis and graph-based visualization.
---
## **Neo4j Visualization**
Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:
![Neo4j Visualization](./path/to/Screenshot_from_2024-11-30_19-01-31.png)
- **Nodes**:
- Query: Represents the user query.
- TransformedQuery: Rephrased or improved query.
- Document: Relevant documents retrieved based on the query.
- Response: The generated response.
- **Relationships**:
- `RETRIEVED`: Links the query to retrieved documents.
- `TRANSFORMED_TO`: Links the original query to the transformed query.
- `GENERATED`: Links the query to the generated response.
---
## **Setup Instructions**
1. Clone the repository:
```bash
git clone https://github.com/your-repo/document-search-system.git
```
Install dependencies:
```bash
pip install -r requirements.txt
```
Initialize the Neo4j database:
Connect to your Neo4j Aura instance.
Set up credentials in the code.
Load the dataset:
Place your documents in the dataset directory (e.g., data-sets/aclImdb/train).
Run the system:
```bash
python document_search_system.py
```
Neo4j Queries
Retrieve All Queries Logged
```cypher
MATCH (q:Query)
RETURN q.text AS query, q.timestamp AS timestamp
ORDER BY timestamp DESC
```
Visualize Query Relationships
```cypher
MATCH (n)-[r]->(m)
RETURN n, r, m
Find Documents for a Query
```
```cypher
MATCH (q:Query {text: "How to improve acting skills?"})-[:RETRIEVED]->(d:Document)
RETURN d.name AS document_name
```
### Key Technologies
Machine Learning Models:
distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis.
google/flan-t5-small for query transformation.
distilgpt2 for response generation.
Vector Similarity Search:
all-MiniLM-L6-v2 embeddings for document retrieval.
Blockchain Logging:
Powered by chainguard.blockchain_logger.
Graph-Based Storage:
Relationships visualized and queried via Neo4j.
vbnet
|