TinyMathReason-1B

TinyMathReason-1B is a 1.12 Billion parameter Llama-style decoder-only transformer trained from scratch specifically for mathematical reasoning. This repository contains the [Base / SFT / GRPO] variant.

Model Description

Developed by: Himanshu Nakrani
Model type: Decoder-only Transformer
Language(s): English, Mathematics, Code
License: Apache 2.0
Architecture: 22 layers, 2048 hidden dimension, 16 Attention heads, 4 KV heads (GQA), SwiGLU activation (5632 intermediate dim).
Parameters: 1.12B total
Context Length: 4096 tokens

Training Details

Pretraining (Base Model)

The base model was trained from a random initialization on Google Cloud TPU v4-32 using the MaxText framework.

Tokens: ~57 Billion (Run 11 Rerun)
Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
Learning Rate: 3e-4 peak, cosine decay
Data Mix:
- 40% FineWeb-Edu
- 35% OpenWebMath
- 15% Proof-Pile-2
- 10% Stack-Edu (Code)

Supervised Fine-Tuning (SFT)

The SFT variant was trained on reasoning traces (MathInstruct, MetaMathQA) formatted in ChatML.

Hardware: 1x AMD MI300X GPU using PyTorch + TRL
Learning Rate: 2e-5 (Cosine schedule)
Epochs: 2

Group Relative Policy Optimization (GRPO)

The GRPO variant was trained using Group Relative Policy Optimization to improve reasoning traces and rule-based correctness.

Hardware: 1x NVIDIA L4 GPU using PyTorch + TRL
Learning Rate: 5e-6
Beta: 0.01
Group Size (G): 8

Evaluation Results

Benchmark	Base Model	SFT Model	GRPO Model
GSM8K (8-shot)	1.0%	1.0%	2.2% (Flex)
Minerva Math (4-shot)	0.0%	0.0%	2.0%
ARC-Challenge (25-shot)	21.7%	24.7%	22.8%
MMLU (5-shot)	23.5%	24.6%	23.6%
HellaSwag (10-shot)	25.8%	26.7%	26.3%

Intended Uses & Limitations

Intended Uses:

Solving step-by-step grade-school to high-school level math problems.
Educational assistance and logic-based chain-of-thought generation.

Limitations:

Being a 1B parameter model, it lacks the broad general knowledge of larger models.
Prone to arithmetic hallucination on very large numbers.
GRPO traces often contain repetitive phrases or mode collapse loops.

Environmental Impact

Hardware Type: TPU v4-32 + NVIDIA A100s
Hours used: ~300 hours (TPU) + ~24 hours (GPU)
Cloud Provider: Google Cloud Platform, Vultr, Modal
Compute Region: us-central
Estimated CO2 Emissions: ~TBD kg CO2 eq.

Citation

@misc{tinymathreason2026,
  author = {Your Name},
  title = {TinyMathReason-1B: A 1 Billion Parameter Mathematical Reasoning LLM Built from Scratch on TPU v4-32},
  year = {2026},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/your-username/TinyMathReason-1B}}
}

Downloads last month: 52

Safetensors

Model size

1B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

himanshunakrani9
/

TinyMathReason-1B-grpo