πŸ™πŸ½ Lessons from Building a Sikh Scripture AI: Retrieval Outperforms Fine-Tuning for Sri Guru Granth Sahib Exegesis


Paper License Base Model


Shanvir Dhinsa Β· EqualizeAI Β· May 2026


πŸ”¬ What Is This?

Granth Expert is a four-layer AI system for computational exegesis of the Sri Guru Granth Sahib Ji β€” the 1,430-page central scripture of Sikhism, composed in Gurmukhi script across Punjabi, Sanskrit, Persian, Hindi, and Braj Bhasha.

This paper documents the complete engineering journey: what worked, what didn't, and why.

https://huggingface.co/datasets/ShanvirDhinsa/sggs-bench


πŸ“Š Headline Result

LoRA Baseline (v1)

64.8 Β± 1.8 / 100

Pure fine-tuning, no retrieval

➑️

v1 + RAG + Prompt Fix

76.6 Β± 0.7 / 100

βœ… +11.8 composite uplift

For low-resource scriptural domains, retrieval engineering is the dominant lever β€” not parameter-efficient fine-tuning.


πŸ—οΈ Four Contributions

πŸ“š 1. SGGS-Exegesis-1430 Corpus

The first machine-readable trilingual scholarly annotation of the complete Sri Guru Granth Sahib Ji:

  • 1,430 Angs Γ— 61,985 verses
  • Gurmukhi text + English translation + Punjabi steek
  • Per-Ang scholarly summaries, themes, mood, and reflection questions
  • Grounded in Dr. Sant Singh Khalsa's translation & Prof. Sahib Singh's Guru Granth Darpan

πŸ“ 2. SGGS-Bench v0.1

An 8-task, 115-question evaluation framework:

  • Factual recall Β· Verse retrieval Β· Scholarly exegesis
  • Life-struggle guidance Β· Hallucination resistance
  • Theological depth Β· Cross-reference synthesis Β· Safety
  • Hybrid scoring: automated + LLM-judge with 5-dim rubric

❌ 3. Negative Result: LoRA Hits a Ceiling

Four LoRA fine-tuning attempts. Zero beat the baseline.

Run Ξ” vs v1 What Happened
v2 -11.7 All dimensions degraded
v3 -4.4 Safety +37, but Exegesis -36
v4 -1.4 Hallucination collapsed -27

Every retraining traded away previously-strong dimensions β€” the alignment tax reproduced four times.

βœ… 4. Positive Result: RAG Delivers

Three retrieval-layer techniques with deterministic, zero-cost gains:

Technique Dimension Ξ”
πŸ”€ Vocabulary bridge Retrieval +7.5
πŸ—ΊοΈ Ang cross-ref map CrossRef +6.5
πŸ“‹ Fact card injection Factual +36.5
πŸ›‘οΈ Hardened prompt Safety +28.5

πŸ“Š Full Per-Dimension Results (n=2 mean)

Dimension Weight Pure v1 v1 + RAG + Fix Ξ” Source of Gain
🎯 Factual15%56.593.0🟒 +36.5SGGS_FACTS card
πŸ” Retrieval15%80.087.5🟒 +7.5TOPIC_ANCHOR expansion
πŸ“– Exegesis15%78.271.0πŸ”΄ -7.2Prompt trade-off (Β§8)
🧭 Guidance20%59.265.2🟒 +6.0Combined effect
πŸ›‘ Hallucination10%75.578.7βšͺ +3.2Within noise
πŸ•‰οΈ Theology5%75.979.0🟒 +3.1Rule-7/13 fix
πŸ”— CrossRef5%48.655.1🟒 +6.5ANG_CROSS_REFS map
πŸ›‘οΈ Safety15%46.575.0🟒 +28.5Hardened system prompt
πŸ“ˆ Composite100%64.8 Β± 1.876.6 Β± 0.7🟒 +11.8Full layered system

πŸ›οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Layer 4: System prompt                                    β”‚
β”‚    13-rule hardened prompt (post-surgical-fix)             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 3: RAG context construction                         β”‚
β”‚    β€’ Vocabulary expansion (modern β†’ Gurbani)               β”‚
β”‚    β€’ Explicit-Ang routing                                  β”‚
β”‚    β€’ Topic-anchor routing (life-struggle keywords)         β”‚
β”‚    β€’ Cross-reference anchor routing (ANG_CROSS_REFS)       β”‚
β”‚    β€’ Structural fact card injection (SGGS_FACTS)           β”‚
β”‚    β€’ Per-verse relevance filtering                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 2: LoRA adapter (v1, frozen)                        β”‚
β”‚    rank 8, 16 layers, lr 1e-5, 1500 iters                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 1: Base model                                       β”‚
β”‚    Qwen3-14B (Alibaba, 4-bit quantized via MLX)            β”‚
β”‚    Apache 2.0 license; ~7.4 GB on disk                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key insight: Layer 2 is frozen. All gains since v1 are achieved in Layers 3 and 4 β€” retrieval and prompting. The lever lives in retrieval, not in model weights.

Runtime: Fully offline on Apple Silicon Β· ~15s model load Β· ~30–60s per query Β· ~12 GB memory


πŸ’‘ Bonus Finding: Prompt Engineering Pitfall

Two specific rule wordings in our 13-rule hardened system prompt induced early-EOS termination on five scholarly questions, costing ~25 composite points until surgically reworded. The diagnostic methodology β€” write a reproducer before guessing fixes β€” generalizes beyond this domain.


πŸ“‹ Corpus & Benchmark Details

πŸ“š SGGS-Exegesis-1430 Corpus Statistics
Metric Value
Total Angs 1,430
Total verses 61,985
Training examples 9,839 (8,400 train + 1,439 validation)
Source languages Gurmukhi, Punjabi, English
Unique authors 35 (6 Gurus, 15 Bhagats, 11 Bhatts, 3 others)
Raag count 31 standard scholarly raags
Sources GurbaniNow API, Dr. Sant Singh Khalsa (1996), Prof. Sahib Singh Guru Granth Darpan (1962–64)
πŸ“ SGGS-Bench v0.1 Task Breakdown
Task Questions Weight Scoring Method
Factual 20 15% Automated (substring match)
Retrieval 20 15% Automated (Ang-number match)
Exegesis 10 15% LLM-judge (5-dim rubric)
Guidance 15 20% Hybrid (anchor-Ang + rubric)
Hallucination 15 10% Automated (T/F + content)
Theology 10 5% LLM-judge (5-dim rubric)
Cross-reference 15 5% Hybrid (Ang set + rubric)
Safety 10 15% Automated (refusal/crisis)
πŸ“ˆ All 10 Bench Runs (Chronological)
# Configuration Composite
1 v1 baseline 66.6
2 v2 LoRA 54.8
3 v1 + Hardened Prompt 62.8
4 v3 LoRA 62.2
5 v4 LoRA 65.2
6 v1 + RAG (run 1) 73.3
7 v1 + RAG (run 2) 72.5
8 v1 + RAG + Fix 77.3
9 Pure v1 (variance check) 63.0
10 v1 + RAG + Fix (variance) 75.9

πŸ™πŸ½ The Honest Scope

We do not claim this system is a substitute for traditional Sikh learning, Gurmat, or guidance from a Giani or Granthi. The Sri Guru Granth Sahib Ji is a living Guru in Sikh practice; any computational tool is at best a study companion.

We claim only that the system identifies relevant passages and explains them in a way that is, on our benchmark, closer to scholarly than to general-purpose β€” and that the architecture lessons generalize.


πŸ“ Citation

@article{dhinsa2026granth,
  title   = {Lessons from Building a Sikh Scripture {AI}:
             Retrieval Outperforms Fine-Tuning for
             {Sri Guru Granth Sahib} Exegesis},
  author  = {Dhinsa, Shanvir},
  year    = {2026},
  note    = {EqualizeAI}
}

Keywords: Religious NLP Β· Low-Resource Scripture Β· RAG Β· LoRA Β· Negative Results Β· Sri Guru Granth Sahib Β· Edge Deployment


ਡਾਹਿਗੁਰੂ ΰ¨œΰ©€ ਕਾ ਖ਼ਾਲਸਾ ਡਾਹਿਗੁਰੂ ΰ¨œΰ©€ ਕੀ ਫ਼਀ਹਿ

Waheguru Ji Ka Khalsa, Waheguru Ji Ki Fateh

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support