Spaces:

chagu13
/

chagu-demo

Running

File size: 10,176 Bytes

# **Document Search System**

## **Overview**
The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.

---

## **Features**
1. **Query Classification:**
   - Detects malicious or inappropriate queries using a sentiment analysis model.
   - Blocks malicious queries and prevents them from further processing.

2. **Query Transformation:**
   - Rephrases or enhances ambiguous queries to improve retrieval accuracy.
   - Uses rule-based transformations and advanced text-to-text models.

3. **RAG Pipeline:**
   - Retrieves top-k documents based on semantic similarity.
   - Generates context-aware responses using generative models.

4. **Blockchain Integration (Chagu):**
   - Logs all stages of query processing into a blockchain for integrity and traceability.
   - Validates blockchain integrity.

5. **Neo4j Integration:**
   - Stores and visualizes relationships between queries, responses, and documents.
   - Allows detailed querying and visualization of the data flow.

---

## **Workflow**

The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:

### **1. Input Query**
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.

---

### **2. Detection Module**
- **Purpose**: Classify the query as "bad" or "good."
- **Steps**:
  1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
  2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
  3. If "good," proceed to the **Transformation Module**.

---

### **3. Transformation Module**
- **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
- **Steps**:
  1. Identify missing context or ambiguous phrasing.
  2. Transform the query using:
     - Rule-based transformations for simple fixes.
     - Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing.
  3. Pass the transformed query to the **RAG Pipeline**.

---

### **4. RAG Pipeline**
- **Purpose**: Retrieve relevant data and generate a context-aware response.
- **Steps**:
  1. **Document Retrieval**:
     - Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`.
     - Compute semantic similarity between the query and stored documents.
     - Retrieve the top-k documents relevant to the query.
  2. **Response Generation**:
     - Use the retrieved documents as context.
     - Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response.

---

### **5. Semantic Response Generation**
- **Purpose**: Provide a concise and meaningful answer.
- **Steps**:
  1. Combine the retrieved documents into a coherent context.
  2. Generate a response tailored to the query using the generative model.
  3. Return the response to the user, ensuring clarity and relevance.

---

### **6. Logging and Storage**
- **Blockchain Logging:**
  - Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
  - Ensures data integrity and tamper-proof records.
- **Neo4j Storage:**
  - Relationships between queries, responses, and retrieved documents are stored in Neo4j.
  - Enables detailed analysis and graph-based visualization.

---

## **Neo4j Visualization**

Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:

![Neo4j Visualization](../../screenshots/Screenshot_from_2024-11-30_19-01-31.png)

- **Nodes**:
  - Query: Represents the user query.
  - TransformedQuery: Rephrased or improved query.
  - Document: Relevant documents retrieved based on the query.
  - Response: The generated response.

- **Relationships**:
  - `RETRIEVED`: Links the query to retrieved documents.
  - `TRANSFORMED_TO`: Links the original query to the transformed query.
  - `GENERATED`: Links the query to the generated response.

---

## **Setup Instructions**
1. Clone the repository:
   ```bash
   git clone https://github.com/your-repo/document-search-system.git
    ```

Here’s the updated README.md content in proper Markdown format with the embedded image reference:

markdown

# **Document Search System**

## **Overview**
The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.

---

## **Features**
1. **Query Classification:**
   - Detects malicious or inappropriate queries using a sentiment analysis model.
   - Blocks malicious queries and prevents them from further processing.

2. **Query Transformation:**
   - Rephrases or enhances ambiguous queries to improve retrieval accuracy.
   - Uses rule-based transformations and advanced text-to-text models.

3. **RAG Pipeline:**
   - Retrieves top-k documents based on semantic similarity.
   - Generates context-aware responses using generative models.

4. **Blockchain Integration (Chagu):**
   - Logs all stages of query processing into a blockchain for integrity and traceability.
   - Validates blockchain integrity.

5. **Neo4j Integration:**
   - Stores and visualizes relationships between queries, responses, and documents.
   - Allows detailed querying and visualization of the data flow.

---

## **Workflow**

The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:

### **1. Input Query**
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.

---

### **2. Detection Module**
- **Purpose**: Classify the query as "bad" or "good."
- **Steps**:
  1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
  2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
  3. If "good," proceed to the **Transformation Module**.

---

### **3. Transformation Module**
- **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
- **Steps**:
  1. Identify missing context or ambiguous phrasing.
  2. Transform the query using:
     - Rule-based transformations for simple fixes.
     - Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing.
  3. Pass the transformed query to the **RAG Pipeline**.

---

### **4. RAG Pipeline**
- **Purpose**: Retrieve relevant data and generate a context-aware response.
- **Steps**:
  1. **Document Retrieval**:
     - Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`.
     - Compute semantic similarity between the query and stored documents.
     - Retrieve the top-k documents relevant to the query.
  2. **Response Generation**:
     - Use the retrieved documents as context.
     - Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response.

---

### **5. Semantic Response Generation**
- **Purpose**: Provide a concise and meaningful answer.
- **Steps**:
  1. Combine the retrieved documents into a coherent context.
  2. Generate a response tailored to the query using the generative model.
  3. Return the response to the user, ensuring clarity and relevance.

---

### **6. Logging and Storage**
- **Blockchain Logging:**
  - Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
  - Ensures data integrity and tamper-proof records.
- **Neo4j Storage:**
  - Relationships between queries, responses, and retrieved documents are stored in Neo4j.
  - Enables detailed analysis and graph-based visualization.

---

## **Neo4j Visualization**

Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:

![Neo4j Visualization](./path/to/Screenshot_from_2024-11-30_19-01-31.png)

- **Nodes**:
  - Query: Represents the user query.
  - TransformedQuery: Rephrased or improved query.
  - Document: Relevant documents retrieved based on the query.
  - Response: The generated response.

- **Relationships**:
  - `RETRIEVED`: Links the query to retrieved documents.
  - `TRANSFORMED_TO`: Links the original query to the transformed query.
  - `GENERATED`: Links the query to the generated response.

---

## **Setup Instructions**
1. Clone the repository:
   ```bash
   git clone https://github.com/your-repo/document-search-system.git
   ```
Install dependencies:

```bash

pip install -r requirements.txt
```
Initialize the Neo4j database:

Connect to your Neo4j Aura instance.
Set up credentials in the code.
Load the dataset:

Place your documents in the dataset directory (e.g., data-sets/aclImdb/train).
Run the system:

```bash

python document_search_system.py
```
Neo4j Queries
Retrieve All Queries Logged
```cypher

MATCH (q:Query)
RETURN q.text AS query, q.timestamp AS timestamp
ORDER BY timestamp DESC
```

Visualize Query Relationships
```cypher

MATCH (n)-[r]->(m)
RETURN n, r, m
Find Documents for a Query

```

```cypher

MATCH (q:Query {text: "How to improve acting skills?"})-[:RETRIEVED]->(d:Document)
RETURN d.name AS document_name
```

### Key Technologies
Machine Learning Models:
distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis.
google/flan-t5-small for query transformation.
distilgpt2 for response generation.
Vector Similarity Search:
all-MiniLM-L6-v2 embeddings for document retrieval.
Blockchain Logging:
Powered by chainguard.blockchain_logger.
Graph-Based Storage:
Relationships visualized and queried via Neo4j.
vbnet