Author

Fine-tuned by Agnivarcas - Adarsh Daksh Pandey on Google Colab Free Tier

Nanbeige4.1-3B -- Reasoning Fine-Tune

A QLoRA fine-tuned version of Nanbeige/Nanbeige4.1-3B, trained on 5 high-complexity synthetic reasoning datasets generated by frontier models (Claude Opus 4.6, Gemini 3 Pro, Claude 4.5 Opus).

The base model is a 3B parameter model built on Nanbeige4-3B-Base with SFT + RL post-training. This fine-tune retains the native <think>...</think> reasoning format.

Training Results

Metric	Before Fine-Tuning	After Fine-Tuning	Change
Eval Loss	2.5358	1.0964	-56.76%
Perplexity	12.63	2.99	-76.29%
Final Train Loss	--	1.0450	--

Benchmark Results: Base vs Fine-Tuned vs Llama-3.1-8B

Benchmarked on MMLU (5 subjects, 50 questions) and MATH (10 problems). Llama-3.1-8B-Instant was included via Groq API as an external reference point (a model nearly 3x larger).

MMLU (50 questions across 5 subjects)

Model	Correct	Accuracy
Nanbeige4.1-3B (base)	14	28.0%
Nanbeige4.1-3B (fine-tuned)	16	32.0%
llama-3.1-8b-instant (Groq)	24	48.0%

Fine-tuning delta: +2 (+4.0 pp)

MMLU Per-Subject Breakdown

Subject	Base	Fine-Tuned	Llama-3.1-8B
abstract_algebra	2/10 (20%)	3/10 (30%)	1/10 (10%)
high_school_mathematics	1/10 (10%)	3/10 (30%)	6/10 (60%)
college_physics	3/10 (30%)	4/10 (40%)	5/10 (50%)
computer_security	3/10 (30%)	3/10 (30%)	7/10 (70%)
formal_logic	5/10 (50%)	3/10 (30%)	5/10 (50%)

Notable: Fine-tuned 3B model outperforms Llama-3.1-8B (3x larger) on abstract algebra (30% vs 10%).

MATH (10 problems)

Model	Correct	Accuracy
Nanbeige4.1-3B (base)	1	10.0%
Nanbeige4.1-3B (fine-tuned)	6	60.0%
llama-3.1-8b-instant (Groq)	9	90.0%

Fine-tuning delta: +5 (+50.0 pp)

Combined Overall Accuracy

Benchmark	Questions	Base	Fine-Tuned	Llama-3.1-8B
MMLU	50	14 (28.0%)	16 (32.0%)	24 (48.0%)
MATH	10	1 (10.0%)	6 (60.0%)	9 (90.0%)
COMBINED	60	15 (25.0%)	22 (36.7%)	33 (55.0%)

Overall delta: +7 questions (+11.7 pp)

Deep Analysis: Why Does the Model Get Questions Wrong?

87% of fine-tuned model responses (39/45) were truncated at 150 max tokens. The model starts a <think> block but never reaches the closing tag or final answer. Of 28 total errors, 24 (86%) were caused by token truncation, not incorrect reasoning. Only 4 failures were genuine capability gaps.

When thinking is cut off, the answer extraction finds "A" from words like "analyzing" or "asked" in the incomplete thinking text, artificially inflating the wrong-answer-A rate.

The fine-tuned model beats Llama-3.1-8B (3x larger) on 8 out of 45 questions, including polynomial algebra in Z_8[x] and predicate logic translation.

Retry experiment: Re-running truncated failures with 1024 tokens showed abstract algebra failures were genuine capability gaps (still wrong), confirming the model needs domain-specific training data for advanced algebra.

Key Finding: The Chain-of-Thought Tradeoff

Fine-tuning on synthetic reasoning traces produced strong gains on MATH (+50 pp) by teaching deeper step-by-step reasoning. However, the model now produces longer think chains than the base model, requiring adequate token budget at inference.

Key observations:

MATH accuracy jumped from 10% to 60% -- largest single improvement
3B fine-tuned model outperforms Llama-3.1-8B on abstract algebra (30% vs 10%)
Computer security showed no change (30% to 30%) -- reasoning data did not cover this domain
Formal logic regressed (50% to 30%) -- possible catastrophic forgetting from SFT overwriting RL behaviors
Model converged by step ~200 of 375; second half was diminishing returns

The base Nanbeige4.1-3B was trained with GRPO reinforcement learning that balances thinking length. SFT fine-tuning shifted that balance toward deeper reasoning, which helps on hard problems but requires adequate generation budget.

Inference Examples

The fine-tuned model produces structured <think> reasoning blocks on hard problems.

Math (competition level): Correctly solved n^2+12n-2007 perfect square problem using completing the square, factored 2043 = 3^2 x 227, found all valid factor pairs.

Code: Produced clean class-based min-heap with correct heapify_up / heapify_down logic, proper index arithmetic.

Science: Correctly explained type I vs type II superconductors, Abrikosov vortices, flux quantization. Note: incorrectly listed copper as a superconductor, demonstrating that synthetic distillation data can reinforce plausible but factually wrong associations.

Base Model Benchmarks (from paper)

Benchmark	Score
LCB-V6 (Pass@1)	76.9
AIME 2026 I	87.40
GPQA	83.8
Arena-Hard-V2	73.2
BFCL-V4	56.50
IMO-Answer-Bench	53.38
Multi-Challenge	52.21
HLE (Text-only)	12.60

Model Details

Property	Value
Base model	Nanbeige/Nanbeige4.1-3B
Parameters	3.9B total
Fine-tuning method	QLoRA (4-bit NF4 + LoRA)
LoRA rank	64
LoRA alpha	128
LoRA target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters	113,770,496 / 2,506,000,896 (4.54%)
Epochs	1
Effective batch size	8 (batch=1 x grad_accum=8)
Learning rate	2e-4 (cosine schedule)
Optimizer	Paged AdamW 8-bit
Precision	FP16
Max sequence length	1024 tokens
Hardware	Google Colab Free Tier -- T4 GPU (15GB VRAM)
Training time	139.8 minutes

Training Datasets (all MIT licensed)

Dataset	Samples Used	Description
TeichAI/claude-4.5-opus-high-reasoning-250x	250	Claude 4.5 Opus coding and reasoning traces
Roman1111111/gemini-3-pro-10000x-hard-high-reasoning	2,000	Gemini 3 Pro extreme-difficulty multi-domain (17.8M tokens)
crownelius/Opus-4.6-Reasoning-3300x	2,000	Claude Opus 4.6 reasoning with thinking traces
crownelius/Opus4.6-No-Reasoning-260x	260	Claude Opus 4.6 direct expert solutions
LEGENDQ/Claude-Opus-4.6-Reasoning-Dataset	2,000	Claude Opus 4.6 multi-domain reasoning

Total: 6,510 samples. A balanced subset of 3,000 training samples was used for the final run.

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_id = "Nanbeige/Nanbeige4.1-3B"
adapter_id    = "Agnivarcas/Nanbeige4.1-3B-reasoning-finetuned"

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

messages = [
    {"role": "user", "content": "Prove that the square root of 2 is irrational."}
]

text   = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

Sequences truncated to 1024 tokens during training
1 epoch on 3,000 samples
Formal logic regressed (50% to 30%) suggesting catastrophic forgetting
Science responses may contain factual errors (copper listed as a superconductor)
Requires 500+ max_new_tokens for proper think chain completion
Inherits biases from base model and training datasets

Author

Fine-tuned by Agnivarcas on Google Colab Free Tier.

Citation

If you use this fine-tuned model, please also cite the original Nanbeige4.1-3B paper:

@misc{yang2026nanbeige413bsmallgeneralmodel,
      title={Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts},
      author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng
              and Tao Zhang and Xiyun Xu and Yang Song and Yiming Jia and Yuntao Wen
              and Yunzhi Xu and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen},
      year={2026},
      eprint={2602.13367},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.13367}
}

Base model repository: https://huggingface.co/Nanbeige/Nanbeige4.1-3B

Downloads last month: 1

Model tree for Agnivarcas/Nanbeige4.1-3B-reasoning-finetuned

Base model

Nanbeige/Nanbeige4-3B-Base

Finetuned

Nanbeige/Nanbeige4.1-3B

Adapter

(8)

this model

Paper for Agnivarcas/Nanbeige4.1-3B-reasoning-finetuned

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

Paper • 2602.13367 • Published Feb 13 • 35