Instructions to use Agnivarcas/Nanbeige4.1-3B-reasoning-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Agnivarcas/Nanbeige4.1-3B-reasoning-finetuned with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Nanbeige/Nanbeige4.1-3B") model = PeftModel.from_pretrained(base_model, "Agnivarcas/Nanbeige4.1-3B-reasoning-finetuned") - Notebooks
- Google Colab
- Kaggle
- Nanbeige4.1-3B -- Reasoning Fine-Tune
- Training Results
- Benchmark Results: Base vs Fine-Tuned vs Llama-3.1-8B
- Deep Analysis: Why Does the Model Get Questions Wrong?
- Key Finding: The Chain-of-Thought Tradeoff
- Inference Examples
- Base Model Benchmarks (from paper)
- Model Details
- Training Datasets (all MIT licensed)
- How to Use
- Limitations
- Author
- Citation
- Training Results
Author
Fine-tuned by Agnivarcas - Adarsh Daksh Pandey on Google Colab Free Tier
Nanbeige4.1-3B -- Reasoning Fine-Tune
A QLoRA fine-tuned version of Nanbeige/Nanbeige4.1-3B, trained on 5 high-complexity synthetic reasoning datasets generated by frontier models (Claude Opus 4.6, Gemini 3 Pro, Claude 4.5 Opus).
The base model is a 3B parameter model built on Nanbeige4-3B-Base with SFT + RL
post-training. This fine-tune retains the native <think>...</think> reasoning format.
Training Results
| Metric | Before Fine-Tuning | After Fine-Tuning | Change |
|---|---|---|---|
| Eval Loss | 2.5358 | 1.0964 | -56.76% |
| Perplexity | 12.63 | 2.99 | -76.29% |
| Final Train Loss | -- | 1.0450 | -- |
Benchmark Results: Base vs Fine-Tuned vs Llama-3.1-8B
Benchmarked on MMLU (5 subjects, 50 questions) and MATH (10 problems). Llama-3.1-8B-Instant was included via Groq API as an external reference point (a model nearly 3x larger).
MMLU (50 questions across 5 subjects)
| Model | Correct | Accuracy |
|---|---|---|
| Nanbeige4.1-3B (base) | 14 | 28.0% |
| Nanbeige4.1-3B (fine-tuned) | 16 | 32.0% |
| llama-3.1-8b-instant (Groq) | 24 | 48.0% |
Fine-tuning delta: +2 (+4.0 pp)
MMLU Per-Subject Breakdown
| Subject | Base | Fine-Tuned | Llama-3.1-8B |
|---|---|---|---|
| abstract_algebra | 2/10 (20%) | 3/10 (30%) | 1/10 (10%) |
| high_school_mathematics | 1/10 (10%) | 3/10 (30%) | 6/10 (60%) |
| college_physics | 3/10 (30%) | 4/10 (40%) | 5/10 (50%) |
| computer_security | 3/10 (30%) | 3/10 (30%) | 7/10 (70%) |
| formal_logic | 5/10 (50%) | 3/10 (30%) | 5/10 (50%) |
Notable: Fine-tuned 3B model outperforms Llama-3.1-8B (3x larger) on abstract algebra (30% vs 10%).
MATH (10 problems)
| Model | Correct | Accuracy |
|---|---|---|
| Nanbeige4.1-3B (base) | 1 | 10.0% |
| Nanbeige4.1-3B (fine-tuned) | 6 | 60.0% |
| llama-3.1-8b-instant (Groq) | 9 | 90.0% |
Fine-tuning delta: +5 (+50.0 pp)
Combined Overall Accuracy
| Benchmark | Questions | Base | Fine-Tuned | Llama-3.1-8B |
|---|---|---|---|---|
| MMLU | 50 | 14 (28.0%) | 16 (32.0%) | 24 (48.0%) |
| MATH | 10 | 1 (10.0%) | 6 (60.0%) | 9 (90.0%) |
| COMBINED | 60 | 15 (25.0%) | 22 (36.7%) | 33 (55.0%) |
Overall delta: +7 questions (+11.7 pp)
Deep Analysis: Why Does the Model Get Questions Wrong?
87% of fine-tuned model responses (39/45) were truncated at 150 max tokens.
The model starts a <think> block but never reaches the closing tag or final
answer. Of 28 total errors, 24 (86%) were caused by token truncation, not
incorrect reasoning. Only 4 failures were genuine capability gaps.
When thinking is cut off, the answer extraction finds "A" from words like "analyzing" or "asked" in the incomplete thinking text, artificially inflating the wrong-answer-A rate.
The fine-tuned model beats Llama-3.1-8B (3x larger) on 8 out of 45 questions, including polynomial algebra in Z_8[x] and predicate logic translation.
Retry experiment: Re-running truncated failures with 1024 tokens showed abstract algebra failures were genuine capability gaps (still wrong), confirming the model needs domain-specific training data for advanced algebra.
Key Finding: The Chain-of-Thought Tradeoff
Fine-tuning on synthetic reasoning traces produced strong gains on MATH (+50 pp) by teaching deeper step-by-step reasoning. However, the model now produces longer think chains than the base model, requiring adequate token budget at inference.
Key observations:
- MATH accuracy jumped from 10% to 60% -- largest single improvement
- 3B fine-tuned model outperforms Llama-3.1-8B on abstract algebra (30% vs 10%)
- Computer security showed no change (30% to 30%) -- reasoning data did not cover this domain
- Formal logic regressed (50% to 30%) -- possible catastrophic forgetting from SFT overwriting RL behaviors
- Model converged by step ~200 of 375; second half was diminishing returns
The base Nanbeige4.1-3B was trained with GRPO reinforcement learning that balances thinking length. SFT fine-tuning shifted that balance toward deeper reasoning, which helps on hard problems but requires adequate generation budget.
Inference Examples
The fine-tuned model produces structured <think> reasoning blocks on hard problems.
Math (competition level): Correctly solved n^2+12n-2007 perfect square problem using completing the square, factored 2043 = 3^2 x 227, found all valid factor pairs.
Code: Produced clean class-based min-heap with correct heapify_up / heapify_down logic, proper index arithmetic.
Science: Correctly explained type I vs type II superconductors, Abrikosov vortices, flux quantization. Note: incorrectly listed copper as a superconductor, demonstrating that synthetic distillation data can reinforce plausible but factually wrong associations.
Base Model Benchmarks (from paper)
| Benchmark | Score |
|---|---|
| LCB-V6 (Pass@1) | 76.9 |
| AIME 2026 I | 87.40 |
| GPQA | 83.8 |
| Arena-Hard-V2 | 73.2 |
| BFCL-V4 | 56.50 |
| IMO-Answer-Bench | 53.38 |
| Multi-Challenge | 52.21 |
| HLE (Text-only) | 12.60 |
Model Details
| Property | Value |
|---|---|
| Base model | Nanbeige/Nanbeige4.1-3B |
| Parameters | 3.9B total |
| Fine-tuning method | QLoRA (4-bit NF4 + LoRA) |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 113,770,496 / 2,506,000,896 (4.54%) |
| Epochs | 1 |
| Effective batch size | 8 (batch=1 x grad_accum=8) |
| Learning rate | 2e-4 (cosine schedule) |
| Optimizer | Paged AdamW 8-bit |
| Precision | FP16 |
| Max sequence length | 1024 tokens |
| Hardware | Google Colab Free Tier -- T4 GPU (15GB VRAM) |
| Training time | 139.8 minutes |
Training Datasets (all MIT licensed)
| Dataset | Samples Used | Description |
|---|---|---|
| TeichAI/claude-4.5-opus-high-reasoning-250x | 250 | Claude 4.5 Opus coding and reasoning traces |
| Roman1111111/gemini-3-pro-10000x-hard-high-reasoning | 2,000 | Gemini 3 Pro extreme-difficulty multi-domain (17.8M tokens) |
| crownelius/Opus-4.6-Reasoning-3300x | 2,000 | Claude Opus 4.6 reasoning with thinking traces |
| crownelius/Opus4.6-No-Reasoning-260x | 260 | Claude Opus 4.6 direct expert solutions |
| LEGENDQ/Claude-Opus-4.6-Reasoning-Dataset | 2,000 | Claude Opus 4.6 multi-domain reasoning |
Total: 6,510 samples. A balanced subset of 3,000 training samples was used for the final run.
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model_id = "Nanbeige/Nanbeige4.1-3B"
adapter_id = "Agnivarcas/Nanbeige4.1-3B-reasoning-finetuned"
tokenizer = AutoTokenizer.from_pretrained(
base_model_id, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()
messages = [
{"role": "user", "content": "Prove that the square root of 2 is irrational."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.6,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Limitations
- Sequences truncated to 1024 tokens during training
- 1 epoch on 3,000 samples
- Formal logic regressed (50% to 30%) suggesting catastrophic forgetting
- Science responses may contain factual errors (copper listed as a superconductor)
- Requires 500+ max_new_tokens for proper think chain completion
- Inherits biases from base model and training datasets
Author
Fine-tuned by Agnivarcas on Google Colab Free Tier.
Citation
If you use this fine-tuned model, please also cite the original Nanbeige4.1-3B paper:
@misc{yang2026nanbeige413bsmallgeneralmodel,
title={Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts},
author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng
and Tao Zhang and Xiyun Xu and Yang Song and Yiming Jia and Yuntao Wen
and Yunzhi Xu and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen},
year={2026},
eprint={2602.13367},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.13367}
}
Base model repository: https://huggingface.co/Nanbeige/Nanbeige4.1-3B
- Downloads last month
- 1