Spaces:

Rom89823974978
/

RAG_Eval

Sleeping

App Files Files Community

Rom89823974978 commited on Jun 5

Commit

fc910c8

1 Parent(s): e8c3964

Added relevant webhooks

Browse files

Files changed (5) hide show

.github/workflows/ci.yml +64 -0
.github/workflows/docs.yml +40 -0
.github/workflows/limits.yml +14 -0
.github/workflows/space.yml +19 -0
README.md +98 -111

.github/workflows/ci.yml ADDED Viewed

	@@ -0,0 +1,64 @@

+name: CI
+on:
+  push:
+    branches: [ main, master ]
+  pull_request:
+    branches: [ main, master ]
+jobs:
+  lint-and-test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10", "3.11"]
+    steps:
+    - name: ⬇️  Check out repo
+      uses: actions/checkout@v4
+    - name: 🐍  Set up Python
+      uses: actions/setup-python@v5
+      with:
+        python-version: ${{ matrix.python-version }}
+        cache: 'pip'
+    - name: 📦  Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -r requirements.txt
+        pip install -r requirements-dev.txt || true   # optional extra dev file
+    - name: 🧹  Pre-commit (black, isort, flake8 …)
+      uses: pre-commit/action@v3.0.1
+    - name: ✅  Run tests w/ coverage
+      run: |
+        pytest -q --cov=evaluation --cov-report=xml
+    - name: 📊  Upload coverage to GitHub summary
+      uses: irongut/CodeCoverageSummary@v1.3.0
+      with:
+        filename: coverage.xml
+        badge: true
+        fail_below_min: true
+        format: markdown
+        output: both
+        thresholds: '60 80'
+    - name: 🗂  Archive test artefacts
+      if: always()
+      uses: actions/upload-artifact@v4
+      with:
+        name: coverage-${{ matrix.python-version }}
+        path: coverage.xml
+  # Optional Docker build sanity-check
+  docker-build:
+    runs-on: ubuntu-latest
+    needs: lint-and-test
+    if: github.event_name == 'push' || github.event_name == 'pull_request'
+    steps:
+    - uses: actions/checkout@v4
+    - name: 🐳 Build Docker image
+      run: docker build -t rag-eval-test .

.github/workflows/docs.yml ADDED Viewed

	@@ -0,0 +1,40 @@

+name: Docs
+on:
+  push:
+    branches: [ main]
+    paths:
+      - 'docs/**'
+      - '.github/workflows/docs.yml'
+      - 'mkdocs.yml'          # if you add a root mkdocs config
+  workflow_dispatch:          # manual trigger
+permissions:
+  contents: write
+  pages: write
+  id-token: write
+jobs:
+  build-and-deploy:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v4
+    - name: 🐍 Set up Python
+      uses: actions/setup-python@v5
+      with:
+        python-version: '3.11'
+    - name: 📦 Install MkDocs + theme
+      run: |
+        pip install mkdocs mkdocs-material
+    - name: 🛠 Build docs
+      run: |
+        mkdocs build --strict
+    - name: 🚀 Deploy to GitHub Pages
+      uses: peaceiris/actions-gh-pages@v4
+      with:
+        github_token: ${{ secrets.GITHUB_TOKEN }}
+        publish_dir: ./site

.github/workflows/limits.yml ADDED Viewed

	@@ -0,0 +1,14 @@

+name: Check file size
+on:
+  pull_request:
+    branches: [main]
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check large files
+        uses: ActionsDesk/lfs-warning@v2.0
+        with:
+          filesizelimit: 10485760 # 10MB, huggingface limit

.github/workflows/space.yml ADDED Viewed

	@@ -0,0 +1,19 @@

+name: Sync to Hugging Face Space
+# This workflow syncs the repository to a Hugging Face Space on push to main branch or manually via workflow dispatch.
+on:
+  push:
+    branches: [main]
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push --force https://Rom89823974978:$HF_TOKEN@huggingface.co/spaces/Rom89823974978/RAG_Eval main

README.md CHANGED Viewed

@@ -1,168 +1,155 @@
-Below is a complete **README.md** you can drop into the repository root.
-It walks through the codebase, explains how each layer aligns with the research-proposal objectives, and gives practical “getting-started” steps for building indexes, running experiments, and producing statistical analyses.
----
 ````markdown
 # Retrieval-Augmented Generation Evaluation Framework
-*(Legal & Financial domains, with full regulatory-grade metrics)*
-> **Project context** – This code implements the software artefacts promised in the research proposal
-> “**Toward Comprehensive Evaluation of Retrieval-Augmented Generation Systems in Regulated Domains**.”
-> Each folder corresponds to a work-package from the proposal: retrieval pipelines, metric library
-> , robustness & statistical analysis, plus automation for Docker / CI.
 ---
-## 1. Quick start
 ```bash
-# Clone and bootstrap
 git clone https://github.com/<your-org>/rag-eval-framework.git
 cd rag-eval-framework
 python -m venv .venv && source .venv/bin/activate
 pip install -r requirements.txt
-pre-commit install             # optional: local lint hooks
-# Download / prepare a small corpus (makes ~200 docs)
 bash scripts/download_data.sh
-# Build sparse & dense indexes automatically on first run
 python scripts/run_experiments.py \
   --config configs/pipeline_hybrid_ce.yaml \
   --queries data/sample_queries.jsonl
 ````
-The first invocation embeds documents, builds a **FAISS** dense index, and a **Pyserini** (Lucene) sparse index. Subsequent runs reuse them.
 ---
-## 2. Repository layout
 ```
-evaluation/                  ← ⚙️  Core library
-├── config.py                ⇢ Typed dataclasses (retriever, generator, stats, reranker)
-├── pipeline.py              ⇢ Orchestrates retrieval → (optional) re-ranking → generation
-│   └── … logs every stage to dict → downstream eval
-├── retrievers/              ⇢ BM25, Dense (Sentence-Transformers + FAISS), Hybrid
-├── rerankers/               ⇢ Cross-encoder re-ranker (optional second stage)
-├── generators/              ⇢ Hugging Face generator wrapper (T5/Flan/BART…)
-├── metrics/                 ⇢ Retrieval, generation, composite RAG score
-└── stats/                   ⇢ Correlation, significance, robustness utilities
-configs/                     ← YAML templates (pipeline & stats settings)
-scripts/                     ← CLI helpers: run_experiments.py, download_data.sh …
-tests/                       ← PyTest smoke tests cover every public module
-.github/workflows/ci.yml     ← Lint + tests on push / PR
-Dockerfile                   ← Slim runtime ready for reproducibility
 ```
 ---
-## 3. How each module maps to proposal tasks
-| Proposal section                       | Code artefact                       | Purpose                                                                                                                                         |
-| -------------------------------------- | ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
-| **Retrievers** (BM25, dense, hybrid)   | `evaluation/retrievers/`            | Implements **RQ1** experiments on classic vs. dense retrieval. Auto-builds indexes to ease replication.                                         |
-| **Generator** (Fixed seq2seq backbone) | `evaluation/generators/`            | Holds the controlled decoding backend so retrieval changes are isolated.                                                                        |
-| **Cross-encoder re-ranker**            | `evaluation/rerankers/`             | Optional “advanced RAG” from Fig. 2 of proposal; improves evidence precision.                                                                   |
-| **Metric taxonomy**                    | `evaluation/metrics/`               | Classical IR metrics, semantic generation scores, and composite `rag_score` per WP3.                                                            |
-| **Statistical tests & sensitivity**    | `evaluation/stats/` + `StatsConfig` | Spearman/ Kendall correlations (**RQ1, RQ2**), Wilcoxon + Holm-Bonferroni (**RQ2**), error-propagation χ² and robustness deltas (**RQ3, RQ4**). |
-| **Reproducibility**                    | Dockerfile, CI, pre-commit          | Meets EU AI Act’s “technical documentation & traceability” clauses (Articles 14-15).                                                            |
 ---
-## 4. Configuration at a glance
-```yaml
-# configs/pipeline_hybrid_ce.yaml
-retriever:
-  name: hybrid                 # bm25 | dense | hybrid
-  bm25_index: indexes/legal_bm25
-  faiss_index: indexes/legal_dense.faiss
-  doc_store: data/legal_docs.jsonl
-  top_k: 10
-  alpha: 0.6
-reranker:
-  enable: true                 # cross-encoder stage
-  model_name: cross-encoder/ms-marco-MiniLM-L-6-v2
-  first_stage_k: 50
-  final_k: 10
-  device: cuda:0
-generator:
-  model_name: google/flan-t5-base
-  device: cuda:0
-  max_new_tokens: 256
-  temperature: 0.0
-stats:
-  correlation_method: spearman
-  n_boot: 5000
-  ci: 0.95
-  wilcoxon_alternative: two-sided
-  multiple_correction: holm-bonferroni
-  alpha: 0.05
 ```
-All fields are documented in `evaluation/config.py`.  You can override any flag via CLI (`--retriever.top_k 20`) if you parse with Hydra or OmegaConf.
----
-## 5. Index generation details
-* **Sparse (BM25 / Lucene)**
-  If `bm25_index` dir is absent, the `BM25Retriever` calls *Pyserini’s* CLI to build it from `doc_store` (JSONL with `{"id", "text"}`).
-* **Dense (FAISS)**
-  Likewise, `DenseRetriever` embeds every document using the Sentence-Transformers model in the config, normalises vectors, and builds an IP-metric FAISS index.
-Both steps cache artefacts, so future runs start instantly.
 ---
-## 6. Running the statistical evaluation
-Each experiment run dumps a JSONL (`results.jsonl`) with per-query fields:
-```jsonc
-{
-  "question": "...",
-  "answer": "...",
-  "contexts": ["..."],
-  "metrics": {
-    "precision@10": 0.9,
-    "rag_score": 0.71,
-    ...
-  },
-  "human_correct": true,        // optional gold labels
-  "human_faithful": 0.8         // optional expert rating 0-1
-}
 ```
-You can feed that into a notebook or CLI script:
 ```python
-from evaluation.stats import (
-    corr_ci, wilcoxon_signed_rank, holm_bonferroni,
-    delta_metric, conditional_failure_rate
-)
 from evaluation import StatsConfig
-cfg = StatsConfig(n_boot=5000)
-# example: correlation of MRR vs. human correctness
-mrr = [r["metrics"]["mrr"] for r in rows]
-gold = [1.0 if r["human_correct"] else 0.0 for r in rows]
-rho, (lo, hi), p = corr_ci(mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot)
-print(f"Spearman ρ={rho:.2f} 95% CI=({lo:.2f},{hi:.2f})  p={p:.3g}")
 ```
-All statistical primitives are implemented in pure NumPy+SciPy, ensuring compatibility with lightweight Docker images.
 ---
-### Happy evaluating!
-Questions or suggestions? Open an issue or discussion on the GitHub repo.
 ```
 ```

 ````markdown
 # Retrieval-Augmented Generation Evaluation Framework
+*(Legal & Financial domains, with full regulatory-grade metrics and dashboard)*
+> **Project context** – Implementation of the research proposal
+> **“Toward Comprehensive Evaluation of Retrieval-Augmented Generation Systems in Regulated Domains.”**
+> Each folder corresponds to a work-package: retrieval pipelines, metric library, robustness & statistical analysis, automation (CI + Docker), and an interactive dashboard.
 ---
+## 1  Quick start
 ```bash
+# ❶ Clone and set up the dev env
 git clone https://github.com/<your-org>/rag-eval-framework.git
 cd rag-eval-framework
 python -m venv .venv && source .venv/bin/activate
 pip install -r requirements.txt
+pre-commit install
+# ❷ Fetch a toy corpus (≈200 docs)
 bash scripts/download_data.sh
+# ❸ First single-config run (indexes auto-build)
 python scripts/run_experiments.py \
   --config configs/pipeline_hybrid_ce.yaml \
   --queries data/sample_queries.jsonl
 ````
+The first call embeds documents, builds a **FAISS** dense index and a **Pyserini** sparse index; subsequent runs reuse them.
 ---
+## 2  Repository layout
 ```
+evaluation/                  ← Core library
+├─ config.py                 • Typed dataclasses (retriever, generator, stats, reranker, logging)
+├─ pipeline.py               • Retrieval → (optional) re-rank → generation
+├─ retrievers/               • BM25, Dense (Sentence-Transformers + FAISS), Hybrid
+├─ rerankers/                • Cross-encoder re-ranker
+├─ generators/               • Hugging Face seq2seq wrapper
+├─ metrics/                  • Retrieval, generation, composite RAG score
+└─ stats/                    • Correlation, significance, robustness utilities
+scripts/                     ← CLI tools
+├─ run_experiments.py        • Single-config runner (logs, metrics, plots)
+├─ run_grid_experiments.py   • **Grid runner** – all configs × datasets, RQ1-RQ4 analysis
+├─ dashboard.py              • **Streamlit dashboard** for interactive exploration
+tests/                       ← PyTest smoke tests
+configs/                     ← YAML templates for pipelines & stats
+.github/workflows/           ← Lint + tests CI
+Dockerfile                   ← Slim reproducible image
 ```
 ---
+## 3  Mapping code ↔ proposal tasks
+| Research-proposal element                         | Code artefact                                                    | Purpose                                                                           |
+| ------------------------------------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------- |
+| **RQ1** Classical retrieval ↔ factual correctness | `evaluation/retrievers/`, `run_grid_experiments.py`              | Computes Spearman / Kendall ρ with CIs for MRR, MAP, P\@k vs *human\_correct*.    |
+| **RQ2** Faithfulness metrics vs expert judgements | `evaluation/metrics/`, `evaluation/stats/`, grid script          | Correlates QAGS, FactScore, RAGAS-F etc. with *human\_faithful*; Wilcoxon + Holm. |
+| **RQ3** Error propagation → hallucination         | `evaluation/stats.robustness`, grid script                       | χ² test, conditional failure rates across corpora / document styles.              |
+| **RQ4** Robustness to adversarial evidence        | Perturbed datasets (`*_pert.jsonl`) + grid script                | Δ-metrics & Cohen’s *d* between clean and perturbed runs.                         |
+| Interactive analysis / decision-making            | `scripts/dashboard.py`                                           | Select dataset + configs, explore tables & plots instantly.                       |
+| EU AI-Act traceability (Art. 14-15)               | Rotating file logging (`evaluation/utils/logger.py`), Docker, CI | Full run provenance (config + log + results + stats) stored under `outputs/`.     |
 ---
+## 4  Running a grid of experiments
+```bash
+# Evaluate three configs on two datasets, save everything under outputs/grid
+python scripts/run_grid_experiments.py \
+  --configs configs/*.yaml \
+  --datasets data/legal.jsonl data/finance.jsonl \
+  --plots
 ```
+*Per dataset* the script writes:
+```
+outputs/grid/<dataset>/<config>/
+    results.jsonl          ← per-query outputs + metrics
+    aggregates.yaml        ← mean metrics
+    rq1.yaml … rq4.yaml    ← answers to each research question
+    mrr_vs_correct.png     ← diagnostic scatter
+outputs/grid/<dataset>/wilcoxon_rag_holm.yaml  ← pairwise p-values
+```
+### Incremental mode
+Run a *single* new config and automatically compare it to all previous ones:
+```bash
+python scripts/run_grid_experiments.py \
+  --configs configs/my_new.yaml \
+  --datasets data/legal.jsonl \
+  --outdir outputs/grid \
+  --plots
+```
 ---
+## 5  Interactive dashboard
+```bash
+streamlit run scripts/dashboard.py
 ```
+The UI lets you
+1. pick a dataset
+2. select any subset of configs
+3. view aggregated tables, bar/box/scatter plots, Wilcoxon tables, and RQ1–RQ4 YAMLs
+4. download raw `results.jsonl` for external analysis
+---
+## 6  Index generation details
+* **Sparse (BM25 / Lucene)** – If `bm25_index` is missing, `BM25Retriever` invokes Pyserini’s CLI to build it from `doc_store` JSONL (`{"id","text"}`).
+* **Dense (FAISS)** – `DenseRetriever` embeds docs with the Sentence-Transformers model in the config, L2-normalises, and writes an IP-metric FAISS index.
+Both artefacts are cached, so the heavy work only happens once.
+---
+## 7  Example: manual statistical scripting
 ```python
+from evaluation.stats import corr_ci
 from evaluation import StatsConfig
+import json, pandas as pd
+rows = [json.loads(l) for l in open("outputs/grid/legal/hybrid/results.jsonl")]
+cfg  = StatsConfig(n_boot=5000)
+mrr  = [r["metrics"]["mrr"] for r in rows]
+gold = [1 if r["human_correct"] else 0 for r in rows]
+r,(lo,hi),p = corr_ci(mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot)
+print(f"Spearman ρ={r:.2f}  95%CI=({lo:.2f},{hi:.2f})  p={p:.3g}")
 ```
+All statistical helpers rely only on **NumPy & SciPy**, so they run in the minimal Docker image.
 ---
+### Happy evaluating & dashboarding!
+Questions or suggestions? Open an issue or start a discussion
 ```
 ```