Llama 3.2 3B — Claude Reasoning Distill

This model was a second attempt at reasoning distillation, with several fixes from the 1B run — but the core approach was still wrong.

1. Same root problem: SFT copies style, not capability - GRPO is the right approach

2. Dataset truncation caused the stopping problem The training dataset (angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k) averages ~1,954 tokens per example, with p90 assistant responses alone hitting ~1,760 tokens. Trained at seq_len=2048, a significant portion of examples were silently truncated — cutting off the <|eot_id|> end-of-turn token before it could be written. The model learned from many examples that responses don't need to end. This is a dataset fit problem, not a model problem.

3. Wrong EOS token at inference Llama 3 has two EOS-like tokens. tokenizer.eos_token_id returns 128001 (<|end_of_text|>), but the model generates 128009 (<|eot_id|>) to end a turn. The default model.generate() call never passes 128009, so generation runs until max_new_tokens. This compounds the truncation issue above.

Same Fix as 1B if you're using this model:

model.generate(
    input_ids=inputs,
    eos_token_id=[128001, 128009],
    max_new_tokens=512,
    repetition_penalty=1.3,
    no_repeat_ngram_size=6,
)

For Ollama, add to your Modelfile:

PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

An updated attempt at distilling Claude Opus 4.6/4.7 reasoning traces into a small-form-factor model. The predecessor Llama 3.2 1B Claude Opus Reasoning Distill demonstrated that a 1B model could adopt <think> blocks but suffered from echolalia and a GSM8K regression. This run addresses the two root causes identified from that experiment:

  1. Capacity — 3B sits closer to the parameter floor where structured reasoning adoption is viable, as seen in models like Gemma 4 E2B-IT and Qwen3-1.7B (which has <think> baked into pretraining)
  2. Token boundaries<think> and </think> are registered as special tokens (vocab 128256 → 128258) with trained embeddings, giving the model a hard mode boundary instead of treating them as plain text
  3. Training on Reponses Only - Unlike 1B run, I used the train_on_responses_only from unsloth to mask out user inputs to have a accuracy increase in multi-turn conversational fine tuning.

Benchmarks will not be available.


Model Details

Field Value
Base model unsloth/Llama-3.2-3B-Instruct-bnb-4bit
Model type Causal LM — LoRA adapter (PEFT) on Llama-3.2-3B-Instruct
Language English
License Meta Llama 3.2 Community License
Training framework Unsloth + TRL SFTTrainer
Hardware Tesla T4 (Kaggle)
Max sequence length 2048

Intended Use

Generating step-by-step reasoning traces (<think> blocks) followed by final answers across a broad range of instruction-following tasks. Useful for studying how reasoning distillation scales to sub-4B models and how registered thinking tokens affect small-model behaviour.

Not intended for: production use, mathematical proofs requiring reliability, or replacing a larger reasoning model. Benchmark regressions vs base are expected until verified otherwise.


How to Get Started

From the adapter

The LoRA adapter is available separately — load it on top of the base model without downloading the full merged weights.

Important: load the tokenizer from the adapter directory, not the base model. The adapter tokenizer carries the correct 128258-token vocabulary with <think>/</think> baked in. Using the base model tokenizer (128256) will cause an embedding dimension mismatch.

from unsloth import FastLanguageModel
from transformers import AutoTokenizer, TextStreamer
from peft import PeftModel

ADAPTER_PATH = "codestrate/Llama3.2-3B-Claude-Reasoning-Distill"

model, _ = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    load_in_4bit=True,
    max_seq_length=2048,
)
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)  # vocab=128258
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
FastLanguageModel.for_inference(model)

SYSTEM_PROMPT = "You are a helpful assistant. Think step by step inside <think>...</think> before giving your final answer."
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Write a Python function to check if a number is prime."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(
    input_ids=inputs,
    streamer=streamer,
    max_new_tokens=1024,
    temperature=0.7,
    min_p=0.1,
    repetition_penalty=1.3,
    no_repeat_ngram_size=6,
    use_cache=True,
)

From GGUF (Ollama / LM Studio)

A Modelfile is included for Ollama. For direct use:

ollama run hf.co/codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Training Details

Dataset

angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7kinstruct_train.jsonl split (full instruct + reasoning, ~7,700 examples). Data already in OpenAI messages format; mapped directly through apply_chat_template with no additional preprocessing.

The previous 1B run used only the coding + math categories (~2,000 examples). This run uses the full instruct split for broader coverage.

Hyperparameters

Parameter Value
LoRA Rank / Alpha 32 / 64
Target Modules All
Sequence Length 2048
Effective Batch 16 (2 × grad_accum 8)
Steps 904 (~2 epochs)
Learning Rate 1e-4 / cosine
Warmup Steps 50
Optimizer adamw_8bit
Weight Decay 0.01
Precision bfloat16

Loss Curve

Loss Curve

Step Loss Step Loss Step Loss
50 2.1372 350 1.8798 650 1.7567
100 1.9597 400 1.8512 700 1.7530
150 1.9251 450 1.8493 750 1.7391
200 1.8972 500 1.7670 800 1.7709
250 1.8891 550 1.7707 850 1.7401
300 1.8738 600 1.7668 900 1.7598

Drop: 2.14 → 1.74 (~0.40 absolute). Visible cross-epoch improvement at step ~452 (−0.082). Plateau reached in epoch 2 from step 750 — a third epoch would not have been beneficial on this dataset.


Known Limitations

  • Benchmarks not yet available — results will be added when the evaluation runs complete
  • Echolalia / repetition — reduced vs the 1B run due to special token boundaries, but not eliminated; repetition_penalty=1.3 and no_repeat_ngram_size=6 are recommended at inference (needs more testing)
  • System prompt required — without the <think>...</think> contract in the system prompt, the model may not cleanly transition from reasoning block to final answer
  • Not a production model — a research artefact studying reasoning distillation at sub-4B scale

Available Files

File Format Use
Llama-3.2-3B-Claude-Reasoning-Distill.Q4_K_M.gguf GGUF Q4_K_M LM Studio / Ollama (recommended)
Llama-3.2-3B-Claude-Reasoning-Distill.Q8_0.gguf GGUF Q8 Higher fidelity inference (near lossless; still lightweight)
Llama-3.2-3B-Claude-Reasoning-Distill.F16.gguf GGUF F16 Full precision GGUF
Adapter (adapter_model.safetensors) LoRA adapter PEFT inference / further fine-tuning

Framework Versions

  • Python 3.12.13
  • Unsloth 2026.5.8
  • PEFT 0.19.1
  • TRL 0.24.0
  • PyTorch 2.10.0+cu128
  • Transformers 4.47.1

Predecessor: Llama3.2-1B-Claude-Opus-Reasoning-Distill
Trained 2x faster with Unsloth

Downloads last month
655
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for codestrate/Llama3.2-3B-Claude-Reasoning-Distill

Dataset used to train codestrate/Llama3.2-3B-Claude-Reasoning-Distill