Disentangling Gradient Quality from Architecture in Recursive Reasoning Models

Two TRM model variants trained on Sudoku-Extreme to isolate the effect of gradient method from architecture design.

Models

Variant Gradient Method Token Accuracy Exact Accuracy
trm_fullbp Full BPTT (O(T) memory) 71.6% 18.9%
trm_1step 1-step gradient (O(1) memory) 65.9% 2.2%

Both share the same flat TRM architecture. The only difference is the gradient method.

Architecture

  • Base: TinyRecursiveModels (TRM)
  • Hidden size: 384
  • Layers: 2
  • Attention heads: 8
  • H-cycles: 3, L-cycles: 6
  • Parameters: ~7M

Training

  • Dataset: Sudoku-Extreme — 500 puzzles + 500 augmentations
  • Epochs: 10,000
  • Batch size: 512 (256 per GPU)
  • Optimizer: AdamW (lr=1e-4, β1=0.9, β2=0.95)
  • Hardware: 2× NVIDIA Tesla T4 (16GB)

Usage

This model requires the TRM codebase from SamsungSAILMontreal/TinyRecursiveModels.

import torch

# Load checkpoint
ckpt = torch.load("trm_fullbp/step_9764", map_location="cpu")

# Load into TRM model (requires TinyRecursiveModels codebase)
# model = TinyRecursiveReasoningModel_ACTV1(config)
# model.load_state_dict(ckpt, strict=False)

Paper

Disentangling Gradient Quality from Architecture in Recursive Reasoning Models | Code

Abstract: We disentangle gradient quality from architecture design in recursive reasoning models by running TRM's flat architecture with HRM's 1-step gradient approximation. Our results show gradient quality, not architecture, is the dominant factor explaining the performance gap.

Citation

@article{jj2026disentangling,
  title={Disentangling Gradient Quality from Architecture in Recursive Reasoning Models},
  author={Jatin (JJ)},
  doi={10.5281/zenodo.20712090},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support