Rom89823974978 commited on
Commit
fc910c8
Β·
1 Parent(s): e8c3964

Added relevant webhooks

Browse files
.github/workflows/ci.yml ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [ main, master ]
6
+ pull_request:
7
+ branches: [ main, master ]
8
+
9
+ jobs:
10
+ lint-and-test:
11
+ runs-on: ubuntu-latest
12
+ strategy:
13
+ matrix:
14
+ python-version: ["3.10", "3.11"]
15
+
16
+ steps:
17
+ - name: ⬇️ Check out repo
18
+ uses: actions/checkout@v4
19
+
20
+ - name: 🐍 Set up Python
21
+ uses: actions/setup-python@v5
22
+ with:
23
+ python-version: ${{ matrix.python-version }}
24
+ cache: 'pip'
25
+
26
+ - name: πŸ“¦ Install dependencies
27
+ run: |
28
+ python -m pip install --upgrade pip
29
+ pip install -r requirements.txt
30
+ pip install -r requirements-dev.txt || true # optional extra dev file
31
+
32
+ - name: 🧹 Pre-commit (black, isort, flake8 …)
33
+ uses: pre-commit/action@v3.0.1
34
+
35
+ - name: βœ… Run tests w/ coverage
36
+ run: |
37
+ pytest -q --cov=evaluation --cov-report=xml
38
+
39
+ - name: πŸ“Š Upload coverage to GitHub summary
40
+ uses: irongut/CodeCoverageSummary@v1.3.0
41
+ with:
42
+ filename: coverage.xml
43
+ badge: true
44
+ fail_below_min: true
45
+ format: markdown
46
+ output: both
47
+ thresholds: '60 80'
48
+
49
+ - name: πŸ—‚ Archive test artefacts
50
+ if: always()
51
+ uses: actions/upload-artifact@v4
52
+ with:
53
+ name: coverage-${{ matrix.python-version }}
54
+ path: coverage.xml
55
+
56
+ # Optional Docker build sanity-check
57
+ docker-build:
58
+ runs-on: ubuntu-latest
59
+ needs: lint-and-test
60
+ if: github.event_name == 'push' || github.event_name == 'pull_request'
61
+ steps:
62
+ - uses: actions/checkout@v4
63
+ - name: 🐳 Build Docker image
64
+ run: docker build -t rag-eval-test .
.github/workflows/docs.yml ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Docs
2
+
3
+ on:
4
+ push:
5
+ branches: [ main]
6
+ paths:
7
+ - 'docs/**'
8
+ - '.github/workflows/docs.yml'
9
+ - 'mkdocs.yml' # if you add a root mkdocs config
10
+ workflow_dispatch: # manual trigger
11
+
12
+ permissions:
13
+ contents: write
14
+ pages: write
15
+ id-token: write
16
+
17
+ jobs:
18
+ build-and-deploy:
19
+ runs-on: ubuntu-latest
20
+ steps:
21
+ - uses: actions/checkout@v4
22
+
23
+ - name: 🐍 Set up Python
24
+ uses: actions/setup-python@v5
25
+ with:
26
+ python-version: '3.11'
27
+
28
+ - name: πŸ“¦ Install MkDocs + theme
29
+ run: |
30
+ pip install mkdocs mkdocs-material
31
+
32
+ - name: πŸ›  Build docs
33
+ run: |
34
+ mkdocs build --strict
35
+
36
+ - name: πŸš€ Deploy to GitHub Pages
37
+ uses: peaceiris/actions-gh-pages@v4
38
+ with:
39
+ github_token: ${{ secrets.GITHUB_TOKEN }}
40
+ publish_dir: ./site
.github/workflows/limits.yml ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Check file size
2
+ on:
3
+ pull_request:
4
+ branches: [main]
5
+ workflow_dispatch:
6
+
7
+ jobs:
8
+ sync-to-hub:
9
+ runs-on: ubuntu-latest
10
+ steps:
11
+ - name: Check large files
12
+ uses: ActionsDesk/lfs-warning@v2.0
13
+ with:
14
+ filesizelimit: 10485760 # 10MB, huggingface limit
.github/workflows/space.yml ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Sync to Hugging Face Space
2
+ # This workflow syncs the repository to a Hugging Face Space on push to main branch or manually via workflow dispatch.
3
+ on:
4
+ push:
5
+ branches: [main]
6
+ workflow_dispatch:
7
+
8
+ jobs:
9
+ sync-to-hub:
10
+ runs-on: ubuntu-latest
11
+ steps:
12
+ - uses: actions/checkout@v3
13
+ with:
14
+ fetch-depth: 0
15
+ lfs: true
16
+ - name: Push to hub
17
+ env:
18
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
19
+ run: git push --force https://Rom89823974978:$HF_TOKEN@huggingface.co/spaces/Rom89823974978/RAG_Eval main
README.md CHANGED
@@ -1,168 +1,155 @@
1
- Below is a complete **README.md** you can drop into the repository root.
2
- It walks through the codebase, explains how each layer aligns with the research-proposal objectives, and gives practical β€œgetting-started” steps for building indexes, running experiments, and producing statistical analyses.
3
-
4
- ---
5
-
6
  ````markdown
7
  # Retrieval-Augmented Generation Evaluation Framework
8
- *(Legal & Financial domains, with full regulatory-grade metrics)*
9
 
10
- > **Project context** – This code implements the software artefacts promised in the research proposal
11
- > β€œ**Toward Comprehensive Evaluation of Retrieval-Augmented Generation Systems in Regulated Domains**.”
12
- > Each folder corresponds to a work-package from the proposal: retrieval pipelines, metric library
13
- > , robustness & statistical analysis, plus automation for Docker / CI.
14
 
15
  ---
16
 
17
- ## 1. Quick start
18
 
19
  ```bash
20
- # Clone and bootstrap
21
  git clone https://github.com/<your-org>/rag-eval-framework.git
22
  cd rag-eval-framework
23
  python -m venv .venv && source .venv/bin/activate
24
  pip install -r requirements.txt
25
- pre-commit install # optional: local lint hooks
26
 
27
- # Download / prepare a small corpus (makes ~200 docs)
28
  bash scripts/download_data.sh
29
 
30
- # Build sparse & dense indexes automatically on first run
31
  python scripts/run_experiments.py \
32
  --config configs/pipeline_hybrid_ce.yaml \
33
  --queries data/sample_queries.jsonl
34
  ````
35
 
36
- The first invocation embeds documents, builds a **FAISS** dense index, and a **Pyserini** (Lucene) sparse index. Subsequent runs reuse them.
37
 
38
  ---
39
 
40
- ## 2. Repository layout
41
 
42
  ```
43
- evaluation/ ← βš™οΈ Core library
44
- β”œβ”€β”€ config.py β‡’ Typed dataclasses (retriever, generator, stats, reranker)
45
- β”œβ”€β”€ pipeline.py β‡’ Orchestrates retrieval β†’ (optional) re-ranking β†’ generation
46
- β”‚ └── … logs every stage to dict β†’ downstream eval
47
- β”œβ”€β”€ retrievers/ β‡’ BM25, Dense (Sentence-Transformers + FAISS), Hybrid
48
- β”œβ”€β”€ rerankers/ β‡’ Cross-encoder re-ranker (optional second stage)
49
- β”œβ”€β”€ generators/ β‡’ Hugging Face generator wrapper (T5/Flan/BART…)
50
- β”œβ”€β”€ metrics/ β‡’ Retrieval, generation, composite RAG score
51
- └── stats/ β‡’ Correlation, significance, robustness utilities
52
- configs/ ← YAML templates (pipeline & stats settings)
53
- scripts/ ← CLI helpers: run_experiments.py, download_data.sh …
54
- tests/ ← PyTest smoke tests cover every public module
55
- .github/workflows/ci.yml ← Lint + tests on push / PR
56
- Dockerfile ← Slim runtime ready for reproducibility
 
 
57
  ```
58
 
59
  ---
60
 
61
- ## 3. How each module maps to proposal tasks
62
 
63
- | Proposal section | Code artefact | Purpose |
64
- | -------------------------------------- | ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
65
- | **Retrievers** (BM25, dense, hybrid) | `evaluation/retrievers/` | Implements **RQ1** experiments on classic vs. dense retrieval. Auto-builds indexes to ease replication. |
66
- | **Generator** (Fixed seq2seq backbone) | `evaluation/generators/` | Holds the controlled decoding backend so retrieval changes are isolated. |
67
- | **Cross-encoder re-ranker** | `evaluation/rerankers/` | Optional β€œadvanced RAG” from Fig. 2 of proposal; improves evidence precision. |
68
- | **Metric taxonomy** | `evaluation/metrics/` | Classical IR metrics, semantic generation scores, and composite `rag_score` per WP3. |
69
- | **Statistical tests & sensitivity** | `evaluation/stats/` + `StatsConfig` | Spearman/ Kendall correlations (**RQ1, RQ2**), Wilcoxon + Holm-Bonferroni (**RQ2**), error-propagation χ² and robustness deltas (**RQ3, RQ4**). |
70
- | **Reproducibility** | Dockerfile, CI, pre-commit | Meets EU AI Act’s β€œtechnical documentation & traceability” clauses (Articles 14-15). |
71
 
72
  ---
73
 
74
- ## 4. Configuration at a glance
75
-
76
- ```yaml
77
- # configs/pipeline_hybrid_ce.yaml
78
- retriever:
79
- name: hybrid # bm25 | dense | hybrid
80
- bm25_index: indexes/legal_bm25
81
- faiss_index: indexes/legal_dense.faiss
82
- doc_store: data/legal_docs.jsonl
83
- top_k: 10
84
- alpha: 0.6
85
-
86
- reranker:
87
- enable: true # cross-encoder stage
88
- model_name: cross-encoder/ms-marco-MiniLM-L-6-v2
89
- first_stage_k: 50
90
- final_k: 10
91
- device: cuda:0
92
-
93
- generator:
94
- model_name: google/flan-t5-base
95
- device: cuda:0
96
- max_new_tokens: 256
97
- temperature: 0.0
98
-
99
- stats:
100
- correlation_method: spearman
101
- n_boot: 5000
102
- ci: 0.95
103
- wilcoxon_alternative: two-sided
104
- multiple_correction: holm-bonferroni
105
- alpha: 0.05
106
  ```
107
 
108
- All fields are documented in `evaluation/config.py`. You can override any flag via CLI (`--retriever.top_k 20`) if you parse with Hydra or OmegaConf.
109
 
110
- ---
 
 
 
 
 
 
 
111
 
112
- ## 5. Index generation details
113
 
114
- * **Sparse (BM25 / Lucene)**
115
- If `bm25_index` dir is absent, the `BM25Retriever` calls *Pyserini’s* CLI to build it from `doc_store` (JSONL with `{"id", "text"}`).
116
- * **Dense (FAISS)**
117
- Likewise, `DenseRetriever` embeds every document using the Sentence-Transformers model in the config, normalises vectors, and builds an IP-metric FAISS index.
118
 
119
- Both steps cache artefacts, so future runs start instantly.
 
 
 
 
 
 
120
 
121
  ---
122
 
123
- ## 6. Running the statistical evaluation
124
-
125
- Each experiment run dumps a JSONL (`results.jsonl`) with per-query fields:
126
-
127
- ```jsonc
128
- {
129
- "question": "...",
130
- "answer": "...",
131
- "contexts": ["..."],
132
- "metrics": {
133
- "precision@10": 0.9,
134
- "rag_score": 0.71,
135
- ...
136
- },
137
- "human_correct": true, // optional gold labels
138
- "human_faithful": 0.8 // optional expert rating 0-1
139
- }
140
  ```
141
 
142
- You can feed that into a notebook or CLI script:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
  ```python
145
- from evaluation.stats import (
146
- corr_ci, wilcoxon_signed_rank, holm_bonferroni,
147
- delta_metric, conditional_failure_rate
148
- )
149
  from evaluation import StatsConfig
 
 
 
 
 
 
 
150
 
151
- cfg = StatsConfig(n_boot=5000)
152
- # example: correlation of MRR vs. human correctness
153
- mrr = [r["metrics"]["mrr"] for r in rows]
154
- gold = [1.0 if r["human_correct"] else 0.0 for r in rows]
155
- rho, (lo, hi), p = corr_ci(mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot)
156
- print(f"Spearman ρ={rho:.2f} 95% CI=({lo:.2f},{hi:.2f}) p={p:.3g}")
157
  ```
158
 
159
- All statistical primitives are implemented in pure NumPy+SciPy, ensuring compatibility with lightweight Docker images.
160
 
161
  ---
162
 
163
- ### Happy evaluating!
164
 
165
- Questions or suggestions? Open an issue or discussion on the GitHub repo.
166
 
167
  ```
168
  ```
 
 
 
 
 
 
1
  ````markdown
2
  # Retrieval-Augmented Generation Evaluation Framework
3
+ *(Legal & Financial domains, with full regulatory-grade metrics and dashboard)*
4
 
5
+ > **Project context** – Implementation of the research proposal
6
+ > **β€œToward Comprehensive Evaluation of Retrieval-Augmented Generation Systems in Regulated Domains.”**
7
+ > Each folder corresponds to a work-package: retrieval pipelines, metric library, robustness & statistical analysis, automation (CI + Docker), and an interactive dashboard.
 
8
 
9
  ---
10
 
11
+ ## 1 Quick start
12
 
13
  ```bash
14
+ # ❢ Clone and set up the dev env
15
  git clone https://github.com/<your-org>/rag-eval-framework.git
16
  cd rag-eval-framework
17
  python -m venv .venv && source .venv/bin/activate
18
  pip install -r requirements.txt
19
+ pre-commit install
20
 
21
+ # ❷ Fetch a toy corpus (β‰ˆ200 docs)
22
  bash scripts/download_data.sh
23
 
24
+ # ❸ First single-config run (indexes auto-build)
25
  python scripts/run_experiments.py \
26
  --config configs/pipeline_hybrid_ce.yaml \
27
  --queries data/sample_queries.jsonl
28
  ````
29
 
30
+ The first call embeds documents, builds a **FAISS** dense index and a **Pyserini** sparse index; subsequent runs reuse them.
31
 
32
  ---
33
 
34
+ ## 2 Repository layout
35
 
36
  ```
37
+ evaluation/ ← Core library
38
+ β”œβ”€ config.py β€’ Typed dataclasses (retriever, generator, stats, reranker, logging)
39
+ β”œβ”€ pipeline.py β€’ Retrieval β†’ (optional) re-rank β†’ generation
40
+ β”œβ”€ retrievers/ β€’ BM25, Dense (Sentence-Transformers + FAISS), Hybrid
41
+ β”œβ”€ rerankers/ β€’ Cross-encoder re-ranker
42
+ β”œβ”€ generators/ β€’ Hugging Face seq2seq wrapper
43
+ β”œβ”€ metrics/ β€’ Retrieval, generation, composite RAG score
44
+ └─ stats/ β€’ Correlation, significance, robustness utilities
45
+ scripts/ ← CLI tools
46
+ β”œβ”€ run_experiments.py β€’ Single-config runner (logs, metrics, plots)
47
+ β”œβ”€ run_grid_experiments.py β€’ **Grid runner** – all configs Γ— datasets, RQ1-RQ4 analysis
48
+ β”œβ”€ dashboard.py β€’ **Streamlit dashboard** for interactive exploration
49
+ tests/ ← PyTest smoke tests
50
+ configs/ ← YAML templates for pipelines & stats
51
+ .github/workflows/ ← Lint + tests CI
52
+ Dockerfile ← Slim reproducible image
53
  ```
54
 
55
  ---
56
 
57
+ ## 3 Mapping code ↔ proposal tasks
58
 
59
+ | Research-proposal element | Code artefact | Purpose |
60
+ | ------------------------------------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------- |
61
+ | **RQ1** Classical retrieval ↔ factual correctness | `evaluation/retrievers/`, `run_grid_experiments.py` | Computes Spearman / Kendall ρ with CIs for MRR, MAP, P\@k vs *human\_correct*. |
62
+ | **RQ2** Faithfulness metrics vs expert judgements | `evaluation/metrics/`, `evaluation/stats/`, grid script | Correlates QAGS, FactScore, RAGAS-F etc. with *human\_faithful*; Wilcoxon + Holm. |
63
+ | **RQ3** Error propagation β†’ hallucination | `evaluation/stats.robustness`, grid script | χ² test, conditional failure rates across corpora / document styles. |
64
+ | **RQ4** Robustness to adversarial evidence | Perturbed datasets (`*_pert.jsonl`) + grid script | Ξ”-metrics & Cohen’s *d* between clean and perturbed runs. |
65
+ | Interactive analysis / decision-making | `scripts/dashboard.py` | Select dataset + configs, explore tables & plots instantly. |
66
+ | EU AI-Act traceability (Art. 14-15) | Rotating file logging (`evaluation/utils/logger.py`), Docker, CI | Full run provenance (config + log + results + stats) stored under `outputs/`. |
67
 
68
  ---
69
 
70
+ ## 4 Running a grid of experiments
71
+
72
+ ```bash
73
+ # Evaluate three configs on two datasets, save everything under outputs/grid
74
+ python scripts/run_grid_experiments.py \
75
+ --configs configs/*.yaml \
76
+ --datasets data/legal.jsonl data/finance.jsonl \
77
+ --plots
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  ```
79
 
80
+ *Per dataset* the script writes:
81
 
82
+ ```
83
+ outputs/grid/<dataset>/<config>/
84
+ results.jsonl ← per-query outputs + metrics
85
+ aggregates.yaml ← mean metrics
86
+ rq1.yaml … rq4.yaml ← answers to each research question
87
+ mrr_vs_correct.png ← diagnostic scatter
88
+ outputs/grid/<dataset>/wilcoxon_rag_holm.yaml ← pairwise p-values
89
+ ```
90
 
91
+ ### Incremental mode
92
 
93
+ Run a *single* new config and automatically compare it to all previous ones:
 
 
 
94
 
95
+ ```bash
96
+ python scripts/run_grid_experiments.py \
97
+ --configs configs/my_new.yaml \
98
+ --datasets data/legal.jsonl \
99
+ --outdir outputs/grid \
100
+ --plots
101
+ ```
102
 
103
  ---
104
 
105
+ ## 5 Interactive dashboard
106
+
107
+ ```bash
108
+ streamlit run scripts/dashboard.py
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  ```
110
 
111
+ The UI lets you
112
+
113
+ 1. pick a dataset
114
+ 2. select any subset of configs
115
+ 3. view aggregated tables, bar/box/scatter plots, Wilcoxon tables, and RQ1–RQ4 YAMLs
116
+ 4. download raw `results.jsonl` for external analysis
117
+
118
+ ---
119
+
120
+ ## 6 Index generation details
121
+
122
+ * **Sparse (BM25 / Lucene)** – If `bm25_index` is missing, `BM25Retriever` invokes Pyserini’s CLI to build it from `doc_store` JSONL (`{"id","text"}`).
123
+ * **Dense (FAISS)** – `DenseRetriever` embeds docs with the Sentence-Transformers model in the config, L2-normalises, and writes an IP-metric FAISS index.
124
+
125
+ Both artefacts are cached, so the heavy work only happens once.
126
+
127
+ ---
128
+
129
+ ## 7 Example: manual statistical scripting
130
 
131
  ```python
132
+ from evaluation.stats import corr_ci
 
 
 
133
  from evaluation import StatsConfig
134
+ import json, pandas as pd
135
+
136
+ rows = [json.loads(l) for l in open("outputs/grid/legal/hybrid/results.jsonl")]
137
+ cfg = StatsConfig(n_boot=5000)
138
+
139
+ mrr = [r["metrics"]["mrr"] for r in rows]
140
+ gold = [1 if r["human_correct"] else 0 for r in rows]
141
 
142
+ r,(lo,hi),p = corr_ci(mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot)
143
+ print(f"Spearman ρ={r:.2f} 95%CI=({lo:.2f},{hi:.2f}) p={p:.3g}")
 
 
 
 
144
  ```
145
 
146
+ All statistical helpers rely only on **NumPy & SciPy**, so they run in the minimal Docker image.
147
 
148
  ---
149
 
150
+ ### Happy evaluating & dashboarding!
151
 
152
+ Questions or suggestions? Open an issue or start a discussion
153
 
154
  ```
155
  ```