chagu-dev / rag_sec /README.md
talexm
adding blockchain logger
e893d68
|
raw
history blame
10.2 kB

Document Search System

Overview

The Document Search System provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.


Features

  1. Query Classification:

    • Detects malicious or inappropriate queries using a sentiment analysis model.
    • Blocks malicious queries and prevents them from further processing.
  2. Query Transformation:

    • Rephrases or enhances ambiguous queries to improve retrieval accuracy.
    • Uses rule-based transformations and advanced text-to-text models.
  3. RAG Pipeline:

    • Retrieves top-k documents based on semantic similarity.
    • Generates context-aware responses using generative models.
  4. Blockchain Integration (Chagu):

    • Logs all stages of query processing into a blockchain for integrity and traceability.
    • Validates blockchain integrity.
  5. Neo4j Integration:

    • Stores and visualizes relationships between queries, responses, and documents.
    • Allows detailed querying and visualization of the data flow.

Workflow

The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:

1. Input Query

  • A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.

2. Detection Module

  • Purpose: Classify the query as "bad" or "good."
  • Steps:
    1. Use a sentiment analysis model (distilbert-base-uncased-finetuned-sst-2-english) to detect malicious or inappropriate intent.
    2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
    3. If "good," proceed to the Transformation Module.

3. Transformation Module

  • Purpose: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
  • Steps:
    1. Identify missing context or ambiguous phrasing.
    2. Transform the query using:
      • Rule-based transformations for simple fixes.
      • Text-to-text models (e.g., google/flan-t5-small) for more sophisticated rephrasing.
    3. Pass the transformed query to the RAG Pipeline.

4. RAG Pipeline

  • Purpose: Retrieve relevant data and generate a context-aware response.
  • Steps:
    1. Document Retrieval:
      • Encode the transformed query and documents into embeddings using all-MiniLM-L6-v2.
      • Compute semantic similarity between the query and stored documents.
      • Retrieve the top-k documents relevant to the query.
    2. Response Generation:
      • Use the retrieved documents as context.
      • Pass the query and context to a generative model (e.g., distilgpt2) to synthesize a meaningful response.

5. Semantic Response Generation

  • Purpose: Provide a concise and meaningful answer.
  • Steps:
    1. Combine the retrieved documents into a coherent context.
    2. Generate a response tailored to the query using the generative model.
    3. Return the response to the user, ensuring clarity and relevance.

6. Logging and Storage

  • Blockchain Logging:
    • Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
    • Ensures data integrity and tamper-proof records.
  • Neo4j Storage:
    • Relationships between queries, responses, and retrieved documents are stored in Neo4j.
    • Enables detailed analysis and graph-based visualization.

Neo4j Visualization

Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:

Neo4j Visualization

  • Nodes:

    • Query: Represents the user query.
    • TransformedQuery: Rephrased or improved query.
    • Document: Relevant documents retrieved based on the query.
    • Response: The generated response.
  • Relationships:

    • RETRIEVED: Links the query to retrieved documents.
    • TRANSFORMED_TO: Links the original query to the transformed query.
    • GENERATED: Links the query to the generated response.

Setup Instructions

  1. Clone the repository:
    git clone https://github.com/your-repo/document-search-system.git
    

Here’s the updated README.md content in proper Markdown format with the embedded image reference:

markdown

Document Search System

Overview

The Document Search System provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.


Features

  1. Query Classification:

    • Detects malicious or inappropriate queries using a sentiment analysis model.
    • Blocks malicious queries and prevents them from further processing.
  2. Query Transformation:

    • Rephrases or enhances ambiguous queries to improve retrieval accuracy.
    • Uses rule-based transformations and advanced text-to-text models.
  3. RAG Pipeline:

    • Retrieves top-k documents based on semantic similarity.
    • Generates context-aware responses using generative models.
  4. Blockchain Integration (Chagu):

    • Logs all stages of query processing into a blockchain for integrity and traceability.
    • Validates blockchain integrity.
  5. Neo4j Integration:

    • Stores and visualizes relationships between queries, responses, and documents.
    • Allows detailed querying and visualization of the data flow.

Workflow

The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:

1. Input Query

  • A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.

2. Detection Module

  • Purpose: Classify the query as "bad" or "good."
  • Steps:
    1. Use a sentiment analysis model (distilbert-base-uncased-finetuned-sst-2-english) to detect malicious or inappropriate intent.
    2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
    3. If "good," proceed to the Transformation Module.

3. Transformation Module

  • Purpose: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
  • Steps:
    1. Identify missing context or ambiguous phrasing.
    2. Transform the query using:
      • Rule-based transformations for simple fixes.
      • Text-to-text models (e.g., google/flan-t5-small) for more sophisticated rephrasing.
    3. Pass the transformed query to the RAG Pipeline.

4. RAG Pipeline

  • Purpose: Retrieve relevant data and generate a context-aware response.
  • Steps:
    1. Document Retrieval:
      • Encode the transformed query and documents into embeddings using all-MiniLM-L6-v2.
      • Compute semantic similarity between the query and stored documents.
      • Retrieve the top-k documents relevant to the query.
    2. Response Generation:
      • Use the retrieved documents as context.
      • Pass the query and context to a generative model (e.g., distilgpt2) to synthesize a meaningful response.

5. Semantic Response Generation

  • Purpose: Provide a concise and meaningful answer.
  • Steps:
    1. Combine the retrieved documents into a coherent context.
    2. Generate a response tailored to the query using the generative model.
    3. Return the response to the user, ensuring clarity and relevance.

6. Logging and Storage

  • Blockchain Logging:
    • Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
    • Ensures data integrity and tamper-proof records.
  • Neo4j Storage:
    • Relationships between queries, responses, and retrieved documents are stored in Neo4j.
    • Enables detailed analysis and graph-based visualization.

Neo4j Visualization

Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:

Neo4j Visualization

  • Nodes:

    • Query: Represents the user query.
    • TransformedQuery: Rephrased or improved query.
    • Document: Relevant documents retrieved based on the query.
    • Response: The generated response.
  • Relationships:

    • RETRIEVED: Links the query to retrieved documents.
    • TRANSFORMED_TO: Links the original query to the transformed query.
    • GENERATED: Links the query to the generated response.

Setup Instructions

  1. Clone the repository:
    git clone https://github.com/your-repo/document-search-system.git
    

Install dependencies:


pip install -r requirements.txt

Initialize the Neo4j database:

Connect to your Neo4j Aura instance. Set up credentials in the code. Load the dataset:

Place your documents in the dataset directory (e.g., data-sets/aclImdb/train). Run the system:


python document_search_system.py

Neo4j Queries Retrieve All Queries Logged


MATCH (q:Query)
RETURN q.text AS query, q.timestamp AS timestamp
ORDER BY timestamp DESC

Visualize Query Relationships


MATCH (n)-[r]->(m)
RETURN n, r, m
Find Documents for a Query

MATCH (q:Query {text: "How to improve acting skills?"})-[:RETRIEVED]->(d:Document)
RETURN d.name AS document_name

Key Technologies

Machine Learning Models: distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis. google/flan-t5-small for query transformation. distilgpt2 for response generation. Vector Similarity Search: all-MiniLM-L6-v2 embeddings for document retrieval. Blockchain Logging: Powered by chainguard.blockchain_logger. Graph-Based Storage: Relationships visualized and queried via Neo4j. vbnet