Spaces:

Qar-Raz
/

NLP-RAG

Running

App Files Files Community

NLP-RAG / README.md

Qar-Raz

Sync backend Docker context from GitHub main

d755b80 verified about 12 hours ago

preview code

raw

history blame contribute delete

6.48 kB

metadata

title: NLP RAG
emoji: 🏢
colorFrom: gray
colorTo: green
sdk: docker
pinned: false
license: mit
short_description: NLP Spring 2026 Project 1

RAG-based Question-Answering System for Cognitive Behavior Therapy (CBT)

Overview

This project is a Retrieval-Augmented Generation (RAG) system built to answer CBT-related questions using grounded evidence from source manuals instead of relying on generic model knowledge. It combines hybrid retrieval, re-ranking, and strict response constraints so the assistant stays accurate, clinically focused, and less prone to hallucinations.

Live Demo and Repository

Live Demo: https://rag-as-3-nlp.vercel.app/
Code Repository: https://github.com/ramailkk/RAG-AS3-NLP

Live Web Interface

Tech Stack

Frontend: Vercel (Node.js/React)
Backend: Hugging Face Spaces (FastAPI)
Vector Database: Pinecone
Embeddings: jinaai/jina-embeddings-v2-small-en
LLMs: Llama-3-8B (Primary), TinyAya, Mistral-7B, Qwen-2.5
Re-ranking: Voyage AI (rerank-2.5) and Cross-Encoder (ms-marco-MiniLM-L-6-v2)
Retrieval: Hybrid Search (Dense + BM25 Sparse)

System Architecture

The system operates through a high-precision multi-stage pipeline to ensure clinical safety and data grounding:

Hybrid Retrieval: Simultaneously queries dense vector indices for semantic intent and sparse BM25 indices for specific clinical terminology such as Socratic Questioning or Cognitive Distortions.
Fusion & Re-ranking: Uses Reciprocal Rank Fusion (RRF) to merge results, followed by a Cross-Encoder stage to re-evaluate the relevance of chunks against the user query.
Diversity Filtering (MMR): Implements Maximal Marginal Relevance to ensure the context provided to the LLM is not redundant.
Prompt Engineering: Employs a specialized persona that acts as an empathetic CBT therapist with strict grounding constraints to prevent the use of outside knowledge.
Automated Evaluation: An LLM-as-a-Judge framework calculates:
- Faithfulness: Verifying claims against the source document.
- Relevancy: Ensuring the answer directly addresses the user's query.

Key Features

Clinical Domain Focus: Optimized for high-density information found in mental health manuals.
Zero Tolerance for Hallucinations: Includes a fallback protocol to state when information is missing rather than inventing therapeutic advice.
Advanced Chunking: Uses sentence-level and recursive character splitting to preserve the logical flow of therapeutic guidelines and patient transcripts.
Multi-Model Support: Tested across multiple LLMs to find the best balance between latency and grounding.

Installation and Setup

Backend Setup

The backend handles document processing, Pinecone vector operations, and the hybrid retrieval logic.

Initialize Virtual Environment:

python -m venv .venv
# Windows
source .venv/Scripts/activate
# Linux/Mac
source .venv/bin/activate

Install Dependencies:
```
pip install -r requirements.txt
```

Launch API Server:

uvicorn backend.api:app --reload --host 0.0.0.0 --port 8000

Frontend Setup

The frontend provides the interactive chat interface and real-time evaluation scores.

Navigate and Install:
```
cd frontend
npm install
```
Start Development Server:
```
npm run dev
```

Configuration

To replicate the system, ensure your environment variables contain valid API keys for:

Pinecone for vector storage
OpenRouter or Hugging Face Inference API for LLM access
Voyage AI for re-ranking

Testing

Run test.py to benchmark the chunking strategies and retrieval configurations, then generate a complete Markdown report of the results.

python test.py

This script evaluates multiple test queries across the configured chunking techniques and retrieval strategies, then writes the full output to retrieval_report.md. Use that report to choose the best chunking strategy and retrieval configuration.

Key variables you can change in `test.py`

test_queries: the questions used for benchmarking.
CHUNKING_TECHNIQUES_FILTERED: the chunking strategies included in the report.
RETRIEVAL_STRATEGIES: the retrieval modes and MMR settings being compared.
index_name: the Pinecone index that stores the chunked data.
top_k and final_k: how many candidates are retrieved and how many are kept in the final context.

Running the Main Pipeline

After testing, run main.py to reproduce the main experiment with the selected configuration and evaluate faithfulness and relevancy across the model set. This script is part of the reproducibility workflow, since changing its configuration lets you rerun the same evaluation under different chunking, retrieval, and model settings.

python main.py

This step runs the end-to-end comparison flow for all models, measures faithfulness and relevancy for each one, and writes the detailed findings to rag_ablation_findings.md.

Key variables you can change in `main.py`

CHUNKING_TECHNIQUES or the technique filter used in the script: controls which chunking methods are evaluated.
test_queries: the query set used for the ablation study.
MODEL_MAP: the model lineup being compared.
retrieval_strategy: the retrieval mode, MMR setting, and label for each run.
top_k and final_k: candidate retrieval depth and final context size.
temperature in cfg.gen: generation randomness for the model outputs.
output_file: the markdown report written by the run, usually rag_ablation_findings.md.

Contributors

Ramail Khan (ramailkk)
Qamar Raza (Qar-Raz)
Muddasir Javed (bsparx)