Author

Fine-tuned by Agnivarcas - Adarsh Daksh Pandey on Google Colab Free Tier

Nanbeige4.1-3B -- Reasoning Fine-Tune

A QLoRA fine-tuned version of Nanbeige/Nanbeige4.1-3B, trained on 5 high-complexity synthetic reasoning datasets generated by frontier models (Claude Opus 4.6, Gemini 3 Pro, Claude 4.5 Opus).

The base model is a 3B parameter model built on Nanbeige4-3B-Base with SFT + RL post-training. This fine-tune retains the native <think>...</think> reasoning format.


Training Results

Metric Before Fine-Tuning After Fine-Tuning Change
Eval Loss 2.5358 1.0964 -56.76%
Perplexity 12.63 2.99 -76.29%
Final Train Loss -- 1.0450 --

Benchmark Results: Base vs Fine-Tuned vs Llama-3.1-8B

Benchmarked on MMLU (5 subjects, 50 questions) and MATH (10 problems). Llama-3.1-8B-Instant was included via Groq API as an external reference point (a model nearly 3x larger).

MMLU (50 questions across 5 subjects)

Model Correct Accuracy
Nanbeige4.1-3B (base) 14 28.0%
Nanbeige4.1-3B (fine-tuned) 16 32.0%
llama-3.1-8b-instant (Groq) 24 48.0%

Fine-tuning delta: +2 (+4.0 pp)

MMLU Per-Subject Breakdown

Subject Base Fine-Tuned Llama-3.1-8B
abstract_algebra 2/10 (20%) 3/10 (30%) 1/10 (10%)
high_school_mathematics 1/10 (10%) 3/10 (30%) 6/10 (60%)
college_physics 3/10 (30%) 4/10 (40%) 5/10 (50%)
computer_security 3/10 (30%) 3/10 (30%) 7/10 (70%)
formal_logic 5/10 (50%) 3/10 (30%) 5/10 (50%)

Notable: Fine-tuned 3B model outperforms Llama-3.1-8B (3x larger) on abstract algebra (30% vs 10%).

MATH (10 problems)

Model Correct Accuracy
Nanbeige4.1-3B (base) 1 10.0%
Nanbeige4.1-3B (fine-tuned) 6 60.0%
llama-3.1-8b-instant (Groq) 9 90.0%

Fine-tuning delta: +5 (+50.0 pp)

Combined Overall Accuracy

Benchmark Questions Base Fine-Tuned Llama-3.1-8B
MMLU 50 14 (28.0%) 16 (32.0%) 24 (48.0%)
MATH 10 1 (10.0%) 6 (60.0%) 9 (90.0%)
COMBINED 60 15 (25.0%) 22 (36.7%) 33 (55.0%)

Overall delta: +7 questions (+11.7 pp)


Deep Analysis: Why Does the Model Get Questions Wrong?

87% of fine-tuned model responses (39/45) were truncated at 150 max tokens. The model starts a <think> block but never reaches the closing tag or final answer. Of 28 total errors, 24 (86%) were caused by token truncation, not incorrect reasoning. Only 4 failures were genuine capability gaps.

When thinking is cut off, the answer extraction finds "A" from words like "analyzing" or "asked" in the incomplete thinking text, artificially inflating the wrong-answer-A rate.

The fine-tuned model beats Llama-3.1-8B (3x larger) on 8 out of 45 questions, including polynomial algebra in Z_8[x] and predicate logic translation.

Retry experiment: Re-running truncated failures with 1024 tokens showed abstract algebra failures were genuine capability gaps (still wrong), confirming the model needs domain-specific training data for advanced algebra.


Key Finding: The Chain-of-Thought Tradeoff

Fine-tuning on synthetic reasoning traces produced strong gains on MATH (+50 pp) by teaching deeper step-by-step reasoning. However, the model now produces longer think chains than the base model, requiring adequate token budget at inference.

Key observations:

  • MATH accuracy jumped from 10% to 60% -- largest single improvement
  • 3B fine-tuned model outperforms Llama-3.1-8B on abstract algebra (30% vs 10%)
  • Computer security showed no change (30% to 30%) -- reasoning data did not cover this domain
  • Formal logic regressed (50% to 30%) -- possible catastrophic forgetting from SFT overwriting RL behaviors
  • Model converged by step ~200 of 375; second half was diminishing returns

The base Nanbeige4.1-3B was trained with GRPO reinforcement learning that balances thinking length. SFT fine-tuning shifted that balance toward deeper reasoning, which helps on hard problems but requires adequate generation budget.


Inference Examples

The fine-tuned model produces structured <think> reasoning blocks on hard problems.

Math (competition level): Correctly solved n^2+12n-2007 perfect square problem using completing the square, factored 2043 = 3^2 x 227, found all valid factor pairs.

Code: Produced clean class-based min-heap with correct heapify_up / heapify_down logic, proper index arithmetic.

Science: Correctly explained type I vs type II superconductors, Abrikosov vortices, flux quantization. Note: incorrectly listed copper as a superconductor, demonstrating that synthetic distillation data can reinforce plausible but factually wrong associations.


Base Model Benchmarks (from paper)

Benchmark Score
LCB-V6 (Pass@1) 76.9
AIME 2026 I 87.40
GPQA 83.8
Arena-Hard-V2 73.2
BFCL-V4 56.50
IMO-Answer-Bench 53.38
Multi-Challenge 52.21
HLE (Text-only) 12.60

Model Details

Property Value
Base model Nanbeige/Nanbeige4.1-3B
Parameters 3.9B total
Fine-tuning method QLoRA (4-bit NF4 + LoRA)
LoRA rank 64
LoRA alpha 128
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters 113,770,496 / 2,506,000,896 (4.54%)
Epochs 1
Effective batch size 8 (batch=1 x grad_accum=8)
Learning rate 2e-4 (cosine schedule)
Optimizer Paged AdamW 8-bit
Precision FP16
Max sequence length 1024 tokens
Hardware Google Colab Free Tier -- T4 GPU (15GB VRAM)
Training time 139.8 minutes

Training Datasets (all MIT licensed)

Dataset Samples Used Description
TeichAI/claude-4.5-opus-high-reasoning-250x 250 Claude 4.5 Opus coding and reasoning traces
Roman1111111/gemini-3-pro-10000x-hard-high-reasoning 2,000 Gemini 3 Pro extreme-difficulty multi-domain (17.8M tokens)
crownelius/Opus-4.6-Reasoning-3300x 2,000 Claude Opus 4.6 reasoning with thinking traces
crownelius/Opus4.6-No-Reasoning-260x 260 Claude Opus 4.6 direct expert solutions
LEGENDQ/Claude-Opus-4.6-Reasoning-Dataset 2,000 Claude Opus 4.6 multi-domain reasoning

Total: 6,510 samples. A balanced subset of 3,000 training samples was used for the final run.


How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_id = "Nanbeige/Nanbeige4.1-3B"
adapter_id    = "Agnivarcas/Nanbeige4.1-3B-reasoning-finetuned"

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

messages = [
    {"role": "user", "content": "Prove that the square root of 2 is irrational."}
]

text   = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

  • Sequences truncated to 1024 tokens during training
  • 1 epoch on 3,000 samples
  • Formal logic regressed (50% to 30%) suggesting catastrophic forgetting
  • Science responses may contain factual errors (copper listed as a superconductor)
  • Requires 500+ max_new_tokens for proper think chain completion
  • Inherits biases from base model and training datasets

Author

Fine-tuned by Agnivarcas on Google Colab Free Tier.


Citation

If you use this fine-tuned model, please also cite the original Nanbeige4.1-3B paper:

@misc{yang2026nanbeige413bsmallgeneralmodel,
      title={Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts},
      author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng
              and Tao Zhang and Xiyun Xu and Yang Song and Yiming Jia and Yuntao Wen
              and Yunzhi Xu and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen},
      year={2026},
      eprint={2602.13367},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.13367}
}

Base model repository: https://huggingface.co/Nanbeige/Nanbeige4.1-3B

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Agnivarcas/Nanbeige4.1-3B-reasoning-finetuned

Adapter
(8)
this model

Paper for Agnivarcas/Nanbeige4.1-3B-reasoning-finetuned