Disentangling Gradient Quality from Architecture in Recursive Reasoning Models

Two TRM model variants trained on Sudoku-Extreme to isolate the effect of gradient method from architecture design.

Models

Variant	Gradient Method	Token Accuracy	Exact Accuracy
`trm_fullbp`	Full BPTT (O(T) memory)	71.6%	18.9%
`trm_1step`	1-step gradient (O(1) memory)	65.9%	2.2%

Both share the same flat TRM architecture. The only difference is the gradient method.

Architecture

Base: TinyRecursiveModels (TRM)
Hidden size: 384
Layers: 2
Attention heads: 8
H-cycles: 3, L-cycles: 6
Parameters: ~7M

Training

Dataset: Sudoku-Extreme — 500 puzzles + 500 augmentations
Epochs: 10,000
Batch size: 512 (256 per GPU)
Optimizer: AdamW (lr=1e-4, β1=0.9, β2=0.95)
Hardware: 2× NVIDIA Tesla T4 (16GB)

Usage

This model requires the TRM codebase from SamsungSAILMontreal/TinyRecursiveModels.

import torch

# Load checkpoint
ckpt = torch.load("trm_fullbp/step_9764", map_location="cpu")

# Load into TRM model (requires TinyRecursiveModels codebase)
# model = TinyRecursiveReasoningModel_ACTV1(config)
# model.load_state_dict(ckpt, strict=False)

Paper

Disentangling Gradient Quality from Architecture in Recursive Reasoning Models | Code

Abstract: We disentangle gradient quality from architecture design in recursive reasoning models by running TRM's flat architecture with HRM's 1-step gradient approximation. Our results show gradient quality, not architecture, is the dominant factor explaining the performance gap.

Citation

@article{jj2026disentangling,
  title={Disentangling Gradient Quality from Architecture in Recursive Reasoning Models},
  author={Jatin (JJ)},
  doi={10.5281/zenodo.20712090},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support