Lumia 62M

Lumia 62M

A 62.8M parameter reasoning language model, fine-tuned from Supra-50M-Reasoning on 35,944 curated reasoning samples.

Small enough to run on a phone. Smart enough to reason.


Model Details

Attribute Value
Architecture LlamaForCausalLM
Parameters 62.8M
Hidden size 448
Layers 14
Attention heads 8 (GQA, 8 KV heads)
Head dim 56
Context length 4096 (YaRN extended, factor 4.0)
Vocab size 32,000
Precision bfloat16 (~125 MB)
License Apache 2.0

Training Configuration

Hyperparameter Value
Framework TRL SFTTrainer + PEFT LoRA
LoRA rank r=32, Ξ±=64 (all linear layers)
Precision fp16, torch.compile enabled
Batch 4 per GPU, gradient accumulation 1
Effective batch 8 (2Γ— T4 DDP)
Learning rate 2e-4 cosine, 5% warmup
Max seq length 4096
Epochs 4 planned, 0.29 completed
Hardware 2Γ— Tesla T4 (16GB each)
Training time ~55 min
Framework versions TRL 1.7.0, PyTorch 2.x

Training Results

Metric Value
Best eval loss 7.8651 (step 1100)
Final train loss 7.7178
Total steps 1,100
Tokens processed 35.7M
Dataset 35,944 train / 734 eval
Samples/sec ~3.93

Loss Curves

Training and Eval Loss

The model shows consistent convergence across 1,100 steps. Train loss drops from 10.47 β†’ 7.72 (26.3% reduction), eval loss from 10.43 β†’ 7.87 (24.6% reduction). No overfitting observed β€” train and eval curves track closely.

Learning Rate Schedule

Learning Rate

Cosine schedule with 5% warmup (55 steps). Peak LR 2e-4 reached at step 900, then cosine decay begins. The steady increase during warmup allows the LoRA adapters to initialize gracefully before full learning kicks in.

Gradient Norm

Gradient Norm

Grad norm stabilizes after ~400 steps. Initial spike at step 400-450 (norm 5.4) is typical for LoRA warmup as adapters find their direction. Settles to 1.5-2.5 range for remainder of training.

Loss Progression Table

Step Train Loss Eval Loss Ξ” Eval
50 10.43 10.43 β€”
100 10.15 10.10 -0.33
200 9.23 9.26 -0.84
300 9.06 9.00 -0.26
400 8.86 8.78 -0.22
500 8.63 8.64 -0.14
600 8.55 8.52 -0.12
700 8.51 8.38 -0.14
800 8.34 8.24 -0.14
900 8.19 8.08 -0.16
1000 8.01 7.96 -0.12
1100 7.72 7.87 -0.09

Quick Start

Install Dependencies

pip install -r requirements.txt

Interactive Chat

python generate.py

This starts an interactive chat session. Type your messages and get responses from Lumia 62M.

Single Prompt

python generate.py --prompt "Write a Python function to check if a number is prime"

Python API

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("samcheng0/lumia-62m")
tokenizer = AutoTokenizer.from_pretrained("samcheng0/lumia-62m")

prompt = """<|system|>
You are an expert programmer. Think step by step.
<|user|>
Write a Python function to check if a number is prime.
<|assistant|>"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Evaluation

python eval.py                    # Run all benchmarks
python eval.py --category math    # Run specific category
python eval.py --verbose          # Show full responses
python eval.py --save results.json  # Save results to file

Load LoRA Adapter (Continued Training)

from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("samcheng0/lumia-62m")
model = PeftModel.from_pretrained(base, "samcheng0/lumia-62m/adapter")

Chat Format

The model supports a chat template with special tokens:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("samcheng0/lumia-62m")
model = AutoModelForCausalLM.from_pretrained("samcheng0/lumia-62m")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
]

# Apply chat template
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Supported Tokens

Token ID Purpose
<|system|> 32010 System prompt
<|user|> 32011 User input
<|assistant|> 32012 Model response
<think> 32008 Start reasoning block
</think> 32009 End reasoning block
[INST] 32013 LLaMA-2 instruction start
[/INST] 32014 LLaMA-2 instruction end
<|code|> 32023 Code block marker
<|text|> 32024 Text block marker
<|math|> 32025 Math block marker
<|think|> 32026 Thinking marker
<|answer|> 32027 Answer marker

Note: All 20 special tokens are single-token IDs. The tokenizer handles them natively for efficient encoding/decoding.

Generation Parameters

Parameter Default Description
temperature 0.7 Controls randomness (lower = more deterministic)
top_p 0.9 Nucleus sampling threshold
max_new_tokens 512 Maximum tokens to generate
repetition_penalty 1.1 Penalizes repeated tokens

Benchmarks

The model was evaluated on 20 test prompts across 5 categories:

Category Prompts Description
Math 4 Arithmetic, algebra, calculus
Code 4 Python functions, complexity analysis
Reasoning 4 Logic puzzles, pattern recognition
General 4 Knowledge, facts, explanations
Indonesian 4 Translation, comprehension

Run the full benchmark suite:

python eval.py --verbose

Dataset

Fine-tuned on samcheng0/lumia-reasoning-sft-v1 β€” 35,944 train + 734 eval samples.

Data Sources (17 datasets)

Source Type Samples
TeichAI/claude-4.5-opus-high-reasoning-250x Reasoning traces ~2.5K
TeichAI/Claude-Opus-4.6-Reasoning-887x Reasoning traces ~1.8K
nohurry/Opus-4.6-Reasoning-3000x-filtered Reasoning traces ~2.1K
angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k Code reasoning ~3.5K
Crownelius/Opus-4.6-Reasoning-3300x Reasoning traces ~3K
nvidia/OpenCodeReasoning Code reasoning 10K (sampled)
nvidia/OpenCodeReasoning-2 Code reasoning 8K
open-r1/Mixture-of-Thoughts Mixed reasoning ~5K
open-thoughts/OpenThoughts-114k Reasoning 8K (sampled)
teknium/OpenHermes-2.5 General chat 30K (sampled)
HuggingFaceH4/ultrachat_200k Multi-turn chat 15K (sampled)
cahya/alpaca-id-cleaned Indonesian instruction ~2K

Filter Pipeline

Raw: ~202K lines β†’ Filtered: ~36K (81.6% filtered out)

Filter Threshold
Min total chars 3,000
Min output chars 1,500
Output/input ratio β‰₯ 1.2
Structural score β‰₯ 4 (=+3, code block=+2, steps=+2)
Dedup MD5 hash

Repo Structure

lumia-62m/
β”œβ”€β”€ config.json                # Model architecture
β”œβ”€β”€ model.safetensors          # Merged weights (inference ready)
β”œβ”€β”€ tokenizer.json             # Tokenizer (with special tokens)
β”œβ”€β”€ tokenizer_config.json      # Tokenizer settings + chat template
β”œβ”€β”€ special_tokens_map.json    # Special tokens ID mapping
β”œβ”€β”€ README.md                  # This file
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ generate.py                # Interactive inference script
β”œβ”€β”€ eval.py                    # Evaluation benchmark
β”œβ”€β”€ add_special_tokens.py      # Token management script
β”œβ”€β”€ banner.svg                 # Header banner
β”œβ”€β”€ loss_curve.svg             # Training loss chart
β”œβ”€β”€ lr_schedule.svg            # Learning rate chart
β”œβ”€β”€ grad_norm.svg              # Gradient norm chart
└── adapter/                   # LoRA adapter + training state
    β”œβ”€β”€ adapter_model.safetensors   # LoRA weights (14.7 MB)
    β”œβ”€β”€ adapter_config.json         # PEFT config
    β”œβ”€β”€ optimizer.pt                # AdamW state (resume training)
    β”œβ”€β”€ scheduler.pt                # LR scheduler state
    β”œβ”€β”€ scaler.pt                   # Gradient scaler
    β”œβ”€β”€ trainer_state.json          # Full training metrics
    └── train.log                   # Training log

Citation

@misc{lumia-62m,
  title={Lumia 62M: A Small Reasoning Language Model},
  author={samcheng0},
  year={2026},
  howpublished={\url{https://huggingface.co/samcheng0/lumia-62m}},
}

License

Apache 2.0

Downloads last month
-
Safetensors
Model size
62.9M params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support