Spaces:
Sleeping
Document Search System
Overview
The Document Search System provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.
Features
Query Classification:
- Detects malicious or inappropriate queries using a sentiment analysis model.
- Blocks malicious queries and prevents them from further processing.
Query Transformation:
- Rephrases or enhances ambiguous queries to improve retrieval accuracy.
- Uses rule-based transformations and advanced text-to-text models.
RAG Pipeline:
- Retrieves top-k documents based on semantic similarity.
- Generates context-aware responses using generative models.
Blockchain Integration (Chagu):
- Logs all stages of query processing into a blockchain for integrity and traceability.
- Validates blockchain integrity.
Neo4j Integration:
- Stores and visualizes relationships between queries, responses, and documents.
- Allows detailed querying and visualization of the data flow.
Workflow
The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:
1. Input Query
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.
2. Detection Module
- Purpose: Classify the query as "bad" or "good."
- Steps:
- Use a sentiment analysis model (
distilbert-base-uncased-finetuned-sst-2-english
) to detect malicious or inappropriate intent. - If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
- If "good," proceed to the Transformation Module.
- Use a sentiment analysis model (
3. Transformation Module
- Purpose: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
- Steps:
- Identify missing context or ambiguous phrasing.
- Transform the query using:
- Rule-based transformations for simple fixes.
- Text-to-text models (e.g.,
google/flan-t5-small
) for more sophisticated rephrasing.
- Pass the transformed query to the RAG Pipeline.
4. RAG Pipeline
- Purpose: Retrieve relevant data and generate a context-aware response.
- Steps:
- Document Retrieval:
- Encode the transformed query and documents into embeddings using
all-MiniLM-L6-v2
. - Compute semantic similarity between the query and stored documents.
- Retrieve the top-k documents relevant to the query.
- Encode the transformed query and documents into embeddings using
- Response Generation:
- Use the retrieved documents as context.
- Pass the query and context to a generative model (e.g.,
distilgpt2
) to synthesize a meaningful response.
- Document Retrieval:
5. Semantic Response Generation
- Purpose: Provide a concise and meaningful answer.
- Steps:
- Combine the retrieved documents into a coherent context.
- Generate a response tailored to the query using the generative model.
- Return the response to the user, ensuring clarity and relevance.
6. Logging and Storage
- Blockchain Logging:
- Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
- Ensures data integrity and tamper-proof records.
- Neo4j Storage:
- Relationships between queries, responses, and retrieved documents are stored in Neo4j.
- Enables detailed analysis and graph-based visualization.
Neo4j Visualization
Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:
Nodes:
- Query: Represents the user query.
- TransformedQuery: Rephrased or improved query.
- Document: Relevant documents retrieved based on the query.
- Response: The generated response.
Relationships:
RETRIEVED
: Links the query to retrieved documents.TRANSFORMED_TO
: Links the original query to the transformed query.GENERATED
: Links the query to the generated response.
Setup Instructions
- Clone the repository:
git clone https://github.com/your-repo/document-search-system.git
Here’s the updated README.md content in proper Markdown format with the embedded image reference:
markdown
Document Search System
Overview
The Document Search System provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.
Features
Query Classification:
- Detects malicious or inappropriate queries using a sentiment analysis model.
- Blocks malicious queries and prevents them from further processing.
Query Transformation:
- Rephrases or enhances ambiguous queries to improve retrieval accuracy.
- Uses rule-based transformations and advanced text-to-text models.
RAG Pipeline:
- Retrieves top-k documents based on semantic similarity.
- Generates context-aware responses using generative models.
Blockchain Integration (Chagu):
- Logs all stages of query processing into a blockchain for integrity and traceability.
- Validates blockchain integrity.
Neo4j Integration:
- Stores and visualizes relationships between queries, responses, and documents.
- Allows detailed querying and visualization of the data flow.
Workflow
The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:
1. Input Query
- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.
2. Detection Module
- Purpose: Classify the query as "bad" or "good."
- Steps:
- Use a sentiment analysis model (
distilbert-base-uncased-finetuned-sst-2-english
) to detect malicious or inappropriate intent. - If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
- If "good," proceed to the Transformation Module.
- Use a sentiment analysis model (
3. Transformation Module
- Purpose: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
- Steps:
- Identify missing context or ambiguous phrasing.
- Transform the query using:
- Rule-based transformations for simple fixes.
- Text-to-text models (e.g.,
google/flan-t5-small
) for more sophisticated rephrasing.
- Pass the transformed query to the RAG Pipeline.
4. RAG Pipeline
- Purpose: Retrieve relevant data and generate a context-aware response.
- Steps:
- Document Retrieval:
- Encode the transformed query and documents into embeddings using
all-MiniLM-L6-v2
. - Compute semantic similarity between the query and stored documents.
- Retrieve the top-k documents relevant to the query.
- Encode the transformed query and documents into embeddings using
- Response Generation:
- Use the retrieved documents as context.
- Pass the query and context to a generative model (e.g.,
distilgpt2
) to synthesize a meaningful response.
- Document Retrieval:
5. Semantic Response Generation
- Purpose: Provide a concise and meaningful answer.
- Steps:
- Combine the retrieved documents into a coherent context.
- Generate a response tailored to the query using the generative model.
- Return the response to the user, ensuring clarity and relevance.
6. Logging and Storage
- Blockchain Logging:
- Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
- Ensures data integrity and tamper-proof records.
- Neo4j Storage:
- Relationships between queries, responses, and retrieved documents are stored in Neo4j.
- Enables detailed analysis and graph-based visualization.
Neo4j Visualization
Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:
Nodes:
- Query: Represents the user query.
- TransformedQuery: Rephrased or improved query.
- Document: Relevant documents retrieved based on the query.
- Response: The generated response.
Relationships:
RETRIEVED
: Links the query to retrieved documents.TRANSFORMED_TO
: Links the original query to the transformed query.GENERATED
: Links the query to the generated response.
Setup Instructions
- Clone the repository:
git clone https://github.com/your-repo/document-search-system.git
Install dependencies:
pip install -r requirements.txt
Initialize the Neo4j database:
Connect to your Neo4j Aura instance. Set up credentials in the code. Load the dataset:
Place your documents in the dataset directory (e.g., data-sets/aclImdb/train). Run the system:
python document_search_system.py
Neo4j Queries Retrieve All Queries Logged
MATCH (q:Query)
RETURN q.text AS query, q.timestamp AS timestamp
ORDER BY timestamp DESC
Visualize Query Relationships
MATCH (n)-[r]->(m)
RETURN n, r, m
Find Documents for a Query
MATCH (q:Query {text: "How to improve acting skills?"})-[:RETRIEVED]->(d:Document)
RETURN d.name AS document_name
Key Technologies
Machine Learning Models: distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis. google/flan-t5-small for query transformation. distilgpt2 for response generation. Vector Similarity Search: all-MiniLM-L6-v2 embeddings for document retrieval. Blockchain Logging: Powered by chainguard.blockchain_logger. Graph-Based Storage: Relationships visualized and queried via Neo4j. vbnet