Disentangling Gradient Quality from Architecture in Recursive Reasoning Models
Two TRM model variants trained on Sudoku-Extreme to isolate the effect of gradient method from architecture design.
Models
| Variant | Gradient Method | Token Accuracy | Exact Accuracy |
|---|---|---|---|
trm_fullbp |
Full BPTT (O(T) memory) | 71.6% | 18.9% |
trm_1step |
1-step gradient (O(1) memory) | 65.9% | 2.2% |
Both share the same flat TRM architecture. The only difference is the gradient method.
Architecture
- Base: TinyRecursiveModels (TRM)
- Hidden size: 384
- Layers: 2
- Attention heads: 8
- H-cycles: 3, L-cycles: 6
- Parameters: ~7M
Training
- Dataset: Sudoku-Extreme — 500 puzzles + 500 augmentations
- Epochs: 10,000
- Batch size: 512 (256 per GPU)
- Optimizer: AdamW (lr=1e-4, β1=0.9, β2=0.95)
- Hardware: 2× NVIDIA Tesla T4 (16GB)
Usage
This model requires the TRM codebase from SamsungSAILMontreal/TinyRecursiveModels.
import torch
# Load checkpoint
ckpt = torch.load("trm_fullbp/step_9764", map_location="cpu")
# Load into TRM model (requires TinyRecursiveModels codebase)
# model = TinyRecursiveReasoningModel_ACTV1(config)
# model.load_state_dict(ckpt, strict=False)
Paper
Disentangling Gradient Quality from Architecture in Recursive Reasoning Models | Code
Abstract: We disentangle gradient quality from architecture design in recursive reasoning models by running TRM's flat architecture with HRM's 1-step gradient approximation. Our results show gradient quality, not architecture, is the dominant factor explaining the performance gap.
Citation
@article{jj2026disentangling,
title={Disentangling Gradient Quality from Architecture in Recursive Reasoning Models},
author={Jatin (JJ)},
doi={10.5281/zenodo.20712090},
year={2026}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support