NLP-RAG / README.md
Qar-Raz's picture
Sync backend Docker context from GitHub main
d755b80 verified
metadata
title: NLP RAG
emoji: 🏢
colorFrom: gray
colorTo: green
sdk: docker
pinned: false
license: mit
short_description: NLP Spring 2026 Project 1

RAG-based Question-Answering System for Cognitive Behavior Therapy (CBT)

Overview

This project is a Retrieval-Augmented Generation (RAG) system built to answer CBT-related questions using grounded evidence from source manuals instead of relying on generic model knowledge. It combines hybrid retrieval, re-ranking, and strict response constraints so the assistant stays accurate, clinically focused, and less prone to hallucinations.

Index

Live Demo and Repository

Live Web Interface

image image

Tech Stack

  • Frontend: Vercel (Node.js/React)
  • Backend: Hugging Face Spaces (FastAPI)
  • Vector Database: Pinecone
  • Embeddings: jinaai/jina-embeddings-v2-small-en
  • LLMs: Llama-3-8B (Primary), TinyAya, Mistral-7B, Qwen-2.5
  • Re-ranking: Voyage AI (rerank-2.5) and Cross-Encoder (ms-marco-MiniLM-L-6-v2)
  • Retrieval: Hybrid Search (Dense + BM25 Sparse)

System Architecture

The system operates through a high-precision multi-stage pipeline to ensure clinical safety and data grounding:

  • Hybrid Retrieval: Simultaneously queries dense vector indices for semantic intent and sparse BM25 indices for specific clinical terminology such as Socratic Questioning or Cognitive Distortions.
  • Fusion & Re-ranking: Uses Reciprocal Rank Fusion (RRF) to merge results, followed by a Cross-Encoder stage to re-evaluate the relevance of chunks against the user query.
  • Diversity Filtering (MMR): Implements Maximal Marginal Relevance to ensure the context provided to the LLM is not redundant.
  • Prompt Engineering: Employs a specialized persona that acts as an empathetic CBT therapist with strict grounding constraints to prevent the use of outside knowledge.
  • Automated Evaluation: An LLM-as-a-Judge framework calculates:
    • Faithfulness: Verifying claims against the source document.
    • Relevancy: Ensuring the answer directly addresses the user's query.

Key Features

  • Clinical Domain Focus: Optimized for high-density information found in mental health manuals.
  • Zero Tolerance for Hallucinations: Includes a fallback protocol to state when information is missing rather than inventing therapeutic advice.
  • Advanced Chunking: Uses sentence-level and recursive character splitting to preserve the logical flow of therapeutic guidelines and patient transcripts.
  • Multi-Model Support: Tested across multiple LLMs to find the best balance between latency and grounding.

Installation and Setup

Backend Setup

The backend handles document processing, Pinecone vector operations, and the hybrid retrieval logic.

  1. Initialize Virtual Environment:

    python -m venv .venv
    # Windows
    source .venv/Scripts/activate
    # Linux/Mac
    source .venv/bin/activate
    
  2. Install Dependencies:

    pip install -r requirements.txt
    
  3. Launch API Server:

    uvicorn backend.api:app --reload --host 0.0.0.0 --port 8000
    

Frontend Setup

The frontend provides the interactive chat interface and real-time evaluation scores.

  1. Navigate and Install:

    cd frontend
    npm install
    
  2. Start Development Server:

    npm run dev
    

Configuration

To replicate the system, ensure your environment variables contain valid API keys for:

  • Pinecone for vector storage
  • OpenRouter or Hugging Face Inference API for LLM access
  • Voyage AI for re-ranking

Testing

Run test.py to benchmark the chunking strategies and retrieval configurations, then generate a complete Markdown report of the results.

python test.py

This script evaluates multiple test queries across the configured chunking techniques and retrieval strategies, then writes the full output to retrieval_report.md. Use that report to choose the best chunking strategy and retrieval configuration.

Key variables you can change in test.py

  • test_queries: the questions used for benchmarking.
  • CHUNKING_TECHNIQUES_FILTERED: the chunking strategies included in the report.
  • RETRIEVAL_STRATEGIES: the retrieval modes and MMR settings being compared.
  • index_name: the Pinecone index that stores the chunked data.
  • top_k and final_k: how many candidates are retrieved and how many are kept in the final context.

Running the Main Pipeline

After testing, run main.py to reproduce the main experiment with the selected configuration and evaluate faithfulness and relevancy across the model set. This script is part of the reproducibility workflow, since changing its configuration lets you rerun the same evaluation under different chunking, retrieval, and model settings.

python main.py

This step runs the end-to-end comparison flow for all models, measures faithfulness and relevancy for each one, and writes the detailed findings to rag_ablation_findings.md.

Key variables you can change in main.py

  • CHUNKING_TECHNIQUES or the technique filter used in the script: controls which chunking methods are evaluated.
  • test_queries: the query set used for the ablation study.
  • MODEL_MAP: the model lineup being compared.
  • retrieval_strategy: the retrieval mode, MMR setting, and label for each run.
  • top_k and final_k: candidate retrieval depth and final context size.
  • temperature in cfg.gen: generation randomness for the model outputs.
  • output_file: the markdown report written by the run, usually rag_ablation_findings.md.

Contributors