Spaces:
Running
Running
# **Document Search System** | |
## **Overview** | |
The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses. | |
--- | |
## **Features** | |
1. **Query Classification:** | |
- Detects malicious or inappropriate queries using a sentiment analysis model. | |
- Blocks malicious queries and prevents them from further processing. | |
2. **Query Transformation:** | |
- Rephrases or enhances ambiguous queries to improve retrieval accuracy. | |
- Uses rule-based transformations and advanced text-to-text models. | |
3. **RAG Pipeline:** | |
- Retrieves top-k documents based on semantic similarity. | |
- Generates context-aware responses using generative models. | |
4. **Blockchain Integration (Chagu):** | |
- Logs all stages of query processing into a blockchain for integrity and traceability. | |
- Validates blockchain integrity. | |
5. **Neo4j Integration:** | |
- Stores and visualizes relationships between queries, responses, and documents. | |
- Allows detailed querying and visualization of the data flow. | |
--- | |
## **Workflow** | |
The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries: | |
### **1. Input Query** | |
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent. | |
--- | |
### **2. Detection Module** | |
- **Purpose**: Classify the query as "bad" or "good." | |
- **Steps**: | |
1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent. | |
2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message. | |
3. If "good," proceed to the **Transformation Module**. | |
--- | |
### **3. Transformation Module** | |
- **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval. | |
- **Steps**: | |
1. Identify missing context or ambiguous phrasing. | |
2. Transform the query using: | |
- Rule-based transformations for simple fixes. | |
- Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing. | |
3. Pass the transformed query to the **RAG Pipeline**. | |
--- | |
### **4. RAG Pipeline** | |
- **Purpose**: Retrieve relevant data and generate a context-aware response. | |
- **Steps**: | |
1. **Document Retrieval**: | |
- Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`. | |
- Compute semantic similarity between the query and stored documents. | |
- Retrieve the top-k documents relevant to the query. | |
2. **Response Generation**: | |
- Use the retrieved documents as context. | |
- Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response. | |
--- | |
### **5. Semantic Response Generation** | |
- **Purpose**: Provide a concise and meaningful answer. | |
- **Steps**: | |
1. Combine the retrieved documents into a coherent context. | |
2. Generate a response tailored to the query using the generative model. | |
3. Return the response to the user, ensuring clarity and relevance. | |
--- | |
### **6. Logging and Storage** | |
- **Blockchain Logging:** | |
- Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability. | |
- Ensures data integrity and tamper-proof records. | |
- **Neo4j Storage:** | |
- Relationships between queries, responses, and retrieved documents are stored in Neo4j. | |
- Enables detailed analysis and graph-based visualization. | |
--- | |
## **Neo4j Visualization** | |
Here is an example of how the relationships between queries, responses, and documents appear in Neo4j: | |
![Neo4j Visualization](../../screenshots/Screenshot_from_2024-11-30_19-01-31.png) | |
- **Nodes**: | |
- Query: Represents the user query. | |
- TransformedQuery: Rephrased or improved query. | |
- Document: Relevant documents retrieved based on the query. | |
- Response: The generated response. | |
- **Relationships**: | |
- `RETRIEVED`: Links the query to retrieved documents. | |
- `TRANSFORMED_TO`: Links the original query to the transformed query. | |
- `GENERATED`: Links the query to the generated response. | |
--- | |
## **Setup Instructions** | |
1. Clone the repository: | |
```bash | |
git clone https://github.com/your-repo/document-search-system.git | |
``` | |
Here’s the updated README.md content in proper Markdown format with the embedded image reference: | |
markdown | |
# **Document Search System** | |
## **Overview** | |
The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses. | |
--- | |
## **Features** | |
1. **Query Classification:** | |
- Detects malicious or inappropriate queries using a sentiment analysis model. | |
- Blocks malicious queries and prevents them from further processing. | |
2. **Query Transformation:** | |
- Rephrases or enhances ambiguous queries to improve retrieval accuracy. | |
- Uses rule-based transformations and advanced text-to-text models. | |
3. **RAG Pipeline:** | |
- Retrieves top-k documents based on semantic similarity. | |
- Generates context-aware responses using generative models. | |
4. **Blockchain Integration (Chagu):** | |
- Logs all stages of query processing into a blockchain for integrity and traceability. | |
- Validates blockchain integrity. | |
5. **Neo4j Integration:** | |
- Stores and visualizes relationships between queries, responses, and documents. | |
- Allows detailed querying and visualization of the data flow. | |
--- | |
## **Workflow** | |
The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries: | |
### **1. Input Query** | |
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent. | |
--- | |
### **2. Detection Module** | |
- **Purpose**: Classify the query as "bad" or "good." | |
- **Steps**: | |
1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent. | |
2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message. | |
3. If "good," proceed to the **Transformation Module**. | |
--- | |
### **3. Transformation Module** | |
- **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval. | |
- **Steps**: | |
1. Identify missing context or ambiguous phrasing. | |
2. Transform the query using: | |
- Rule-based transformations for simple fixes. | |
- Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing. | |
3. Pass the transformed query to the **RAG Pipeline**. | |
--- | |
### **4. RAG Pipeline** | |
- **Purpose**: Retrieve relevant data and generate a context-aware response. | |
- **Steps**: | |
1. **Document Retrieval**: | |
- Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`. | |
- Compute semantic similarity between the query and stored documents. | |
- Retrieve the top-k documents relevant to the query. | |
2. **Response Generation**: | |
- Use the retrieved documents as context. | |
- Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response. | |
--- | |
### **5. Semantic Response Generation** | |
- **Purpose**: Provide a concise and meaningful answer. | |
- **Steps**: | |
1. Combine the retrieved documents into a coherent context. | |
2. Generate a response tailored to the query using the generative model. | |
3. Return the response to the user, ensuring clarity and relevance. | |
--- | |
### **6. Logging and Storage** | |
- **Blockchain Logging:** | |
- Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability. | |
- Ensures data integrity and tamper-proof records. | |
- **Neo4j Storage:** | |
- Relationships between queries, responses, and retrieved documents are stored in Neo4j. | |
- Enables detailed analysis and graph-based visualization. | |
--- | |
## **Neo4j Visualization** | |
Here is an example of how the relationships between queries, responses, and documents appear in Neo4j: | |
![Neo4j Visualization](./path/to/Screenshot_from_2024-11-30_19-01-31.png) | |
- **Nodes**: | |
- Query: Represents the user query. | |
- TransformedQuery: Rephrased or improved query. | |
- Document: Relevant documents retrieved based on the query. | |
- Response: The generated response. | |
- **Relationships**: | |
- `RETRIEVED`: Links the query to retrieved documents. | |
- `TRANSFORMED_TO`: Links the original query to the transformed query. | |
- `GENERATED`: Links the query to the generated response. | |
--- | |
## **Setup Instructions** | |
1. Clone the repository: | |
```bash | |
git clone https://github.com/your-repo/document-search-system.git | |
``` | |
Install dependencies: | |
```bash | |
pip install -r requirements.txt | |
``` | |
Initialize the Neo4j database: | |
Connect to your Neo4j Aura instance. | |
Set up credentials in the code. | |
Load the dataset: | |
Place your documents in the dataset directory (e.g., data-sets/aclImdb/train). | |
Run the system: | |
```bash | |
python document_search_system.py | |
``` | |
Neo4j Queries | |
Retrieve All Queries Logged | |
```cypher | |
MATCH (q:Query) | |
RETURN q.text AS query, q.timestamp AS timestamp | |
ORDER BY timestamp DESC | |
``` | |
Visualize Query Relationships | |
```cypher | |
MATCH (n)-[r]->(m) | |
RETURN n, r, m | |
Find Documents for a Query | |
``` | |
```cypher | |
MATCH (q:Query {text: "How to improve acting skills?"})-[:RETRIEVED]->(d:Document) | |
RETURN d.name AS document_name | |
``` | |
### Key Technologies | |
Machine Learning Models: | |
distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis. | |
google/flan-t5-small for query transformation. | |
distilgpt2 for response generation. | |
Vector Similarity Search: | |
all-MiniLM-L6-v2 embeddings for document retrieval. | |
Blockchain Logging: | |
Powered by chainguard.blockchain_logger. | |
Graph-Based Storage: | |
Relationships visualized and queried via Neo4j. | |
vbnet | |