Antahkarana-v2 (36.5M)
The accuracy-recovering version of the Antaḥkaraṇa continual-learning architecture — it matches the state-of-the-art on accuracy and forgets ~3× less, on a fair, multi-seed benchmark.
Author: Deepak Soni · License: MIT · Trained from scratch (WideResNet-28-10, 36.5M params — not a fine-tune of any pretrained model; entirely original work).
This is the research-grade vision model that proves the architecture, and the direct ancestor of the
language model deepakdsoni/antahkarana-7B.
📦 Model family
| Model | What |
|---|---|
| antahkarana-v1 | the original architecture + v1 vision models — the most stable continual learner (only positive backward transfer) |
| antahkarana-v2 | accuracy-recovering v2 (36.5M) — matches SOTA accuracy at ~3× less forgetting |
| antahkarana-7B | the architecture scaled to a 7B language model |
Why we built v2 (the accuracy problem)
The original Antaḥkaraṇa-v1 was the most stable continual learner in our benchmark — the lowest forgetting of any method and the only one with positive backward transfer (learning new tasks slightly improves old ones). But that stability came at a cost: its raw accuracy (0.643) trailed the SOTA, DER++ (0.804). It sat at the ultra-stable end of the stability–plasticity frontier.
v2 was built to recover that accuracy without giving up the stability — by adding two more faculties from the Vedic model of mind:
| Added faculty (Vedic) | Mechanism (ML) | Why it helps |
|---|---|---|
| vijñāna-smṛti | dark-knowledge / logit replay | rehearses past tasks' output distributions, transferring understanding (not just labels) → big accuracy lift |
| viveka | selective consolidation (keep top-k% of importance Ω) | discerns the essential from the inessential → protects what matters without over-rigidity |
(plus a guṇa controller that adapts the consolidation strength to how much the model is forgetting.)
What v2 achieved
With the breakthrough configuration (buffer=5120, viveka_keep=0.2, α=0.5), v2 recovers accuracy to SOTA
level while keeping forgetting ~3× lower than the SOTA — this is domination of the trade-off, not a
compromise:
| Method | Accuracy ↑ | Forgetting ↓ | BWT |
|---|---|---|---|
| DER++ (SOTA) | 0.804 ± .014 | 0.067 ± .017 | −0.064 |
| Antaḥkaraṇa-v2 (this model) | 0.799 ± .008 | 0.023 ± .002 | −0.015 |
| Antaḥkaraṇa-v1 (stable variant) | 0.643 | 0.017 | +0.008 |
Across the whole field, v2 is the only method that is both accurate and stable — it lands in the "ideal corner":
What we tested (it generalizes)
The result is not a single-dataset artifact — the pattern (≈ DER++ accuracy, far less forgetting) holds across datasets, stream lengths, and model sizes (each a separate, fair, multi-seed run):
| Setting | DER++ (acc / forget) | v2 (acc / forget) | v1 (acc / forget) |
|---|---|---|---|
| Split-CIFAR-100, 10 tasks (headline) | 0.804 / 0.067 | 0.799 / 0.023 | 0.643 / 0.017 |
| Split-CIFAR-100, 20-task lifelong | 0.827 / 0.060 | 0.782 / 0.054 | 0.638 / 0.040 |
| Split-Tiny-ImageNet, 200-class | 0.470 / 0.177 | 0.456 / 0.108 | 0.380 / 0.013 |
| Bigger backbone WRN-28-12 (52.6M) | 0.790 / 0.070 | 0.753 / 0.037 | 0.603 / 0.018 |
On the hard Tiny-ImageNet, DER++ forgets catastrophically (0.177) while v2 stays at 0.108. Together, v1
(ultra-stable) and v2 (high-accuracy) form a tunable stability↔accuracy family — knobs: alpha (replay
distillation weight), viveka_keep (selective-consolidation fraction).
Model details
| Architecture | WideResNet-28-10 trunk + per-task linear heads (task-incremental) |
| Params | 36.5M (a 52.6M WRN-28-12 variant was also validated) |
| Method | Antaḥkaraṇa-v2: saṃskāra (EWC + decay) · guṇa · vijñāna-smṛti (logit replay) · viveka · pramāṇa · turīya |
| Config | λ=10 · decay=0.7 · buffer=5120 · α=0.5 · viveka_keep=0.2 · 25 epochs/task |
| Benchmark | Split-CIFAR-100 (10 tasks × 10 classes), trained from scratch, 5 seeds |
| Formats | .pt (weights + saṃskāra Ω/θ*) · model.safetensors · config.json |
Usage
# load_akn.py is included — self-contained, no repo needed
from load_akn import load
model, ckpt = load("antahkarana-v2-36.5M-cifar100-wrn28-10.pt")
logits = model(x, task) # x: [N,3,32,32] CIFAR-100 tensor; task in [0..9]
print(ckpt["config"]["metrics"]) # honest per-task metrics
print(ckpt["omega"].keys()) # the saṃskāra importance Ω it chose to protect
The bigger picture — where v2 leads
v2's core idea — dark-knowledge (logit) replay — is especially powerful for large models, where knowledge transfers through output distributions. That is exactly why it became the base for the language model: the same mechanisms, ported onto a frozen backbone, produced Antahkarana-7B (continual learning on a 7B LLM, ~3.8× less forgetting than naive LoRA). The scaling path runs 36.5M (this model) → 7B → toward 13B–70B.
License & citation
Released under the MIT License — fully original work, trained from scratch (no pretrained base model). Evaluated on CIFAR-100 (Krizhevsky, 2009) and Tiny-ImageNet.
@misc{antahkaranav2_2026,
title = {Antahkarana-v2: Recovering Accuracy in Vedic-Derived Continual Learning},
author = {Deepak Soni},
year = {2026},
url = {https://huggingface.co/deepakdsoni/antahkarana-v2}
}
Built on the Upaniṣads, Sāṃkhya, Yoga, Nyāya, and modern ML (PyTorch).
- Downloads last month
- 28

