Instructions to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="codestrate/Llama3.2-3B-Claude-Reasoning-Distill")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("codestrate/Llama3.2-3B-Claude-Reasoning-Distill")
model = AutoModelForCausalLM.from_pretrained("codestrate/Llama3.2-3B-Claude-Reasoning-Distill")

llama-cpp-python

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="codestrate/Llama3.2-3B-Claude-Reasoning-Distill",
	filename="Llama-3b-ft-claude-merged.F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Use Docker

docker model run hf.co/codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

LM Studio
Jan

vLLM

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "codestrate/Llama3.2-3B-Claude-Reasoning-Distill"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codestrate/Llama3.2-3B-Claude-Reasoning-Distill",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

SGLang

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "codestrate/Llama3.2-3B-Claude-Reasoning-Distill" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codestrate/Llama3.2-3B-Claude-Reasoning-Distill",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "codestrate/Llama3.2-3B-Claude-Reasoning-Distill" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codestrate/Llama3.2-3B-Claude-Reasoning-Distill",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with Ollama:
```
ollama run hf.co/codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M
```

Unsloth Studio

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for codestrate/Llama3.2-3B-Claude-Reasoning-Distill to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for codestrate/Llama3.2-3B-Claude-Reasoning-Distill to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for codestrate/Llama3.2-3B-Claude-Reasoning-Distill to start chatting

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with Docker Model Runner:
```
docker model run hf.co/codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M
```

Lemonade

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Run and chat with the model

lemonade run user.Llama3.2-3B-Claude-Reasoning-Distill-Q4_K_M

List all available models

lemonade list

Llama 3.2 3B — Claude Reasoning Distill

This model was a second attempt at reasoning distillation, with several fixes from the 1B run — but the core approach was still wrong.

1. Same root problem: SFT copies style, not capability - GRPO is the right approach

2. Dataset truncation caused the stopping problem The training dataset (angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k) averages ~1,954 tokens per example, with p90 assistant responses alone hitting ~1,760 tokens. Trained at seq_len=2048, a significant portion of examples were silently truncated — cutting off the <|eot_id|> end-of-turn token before it could be written. The model learned from many examples that responses don't need to end. This is a dataset fit problem, not a model problem.

3. Wrong EOS token at inference Llama 3 has two EOS-like tokens. tokenizer.eos_token_id returns 128001 (<|end_of_text|>), but the model generates 128009 (<|eot_id|>) to end a turn. The default model.generate() call never passes 128009, so generation runs until max_new_tokens. This compounds the truncation issue above.

Same Fix as 1B if you're using this model:

model.generate(
    input_ids=inputs,
    eos_token_id=[128001, 128009],
    max_new_tokens=512,
    repetition_penalty=1.3,
    no_repeat_ngram_size=6,
)

For Ollama, add to your Modelfile:

PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

An updated attempt at distilling Claude Opus 4.6/4.7 reasoning traces into a small-form-factor model. The predecessor Llama 3.2 1B Claude Opus Reasoning Distill demonstrated that a 1B model could adopt <think> blocks but suffered from echolalia and a GSM8K regression. This run addresses the two root causes identified from that experiment:

Capacity — 3B sits closer to the parameter floor where structured reasoning adoption is viable, as seen in models like Gemma 4 E2B-IT and Qwen3-1.7B (which has <think> baked into pretraining)
Token boundaries — <think> and </think> are registered as special tokens (vocab 128256 → 128258) with trained embeddings, giving the model a hard mode boundary instead of treating them as plain text
Training on Reponses Only - Unlike 1B run, I used the train_on_responses_only from unsloth to mask out user inputs to have a accuracy increase in multi-turn conversational fine tuning.

Benchmarks will not be available.

Model Details

Field	Value
Base model	`unsloth/Llama-3.2-3B-Instruct-bnb-4bit`
Model type	Causal LM — LoRA adapter (PEFT) on Llama-3.2-3B-Instruct
Language	English
License	Meta Llama 3.2 Community License
Training framework	Unsloth + TRL SFTTrainer
Hardware	Tesla T4 (Kaggle)
Max sequence length	2048

Intended Use

Generating step-by-step reasoning traces (<think> blocks) followed by final answers across a broad range of instruction-following tasks. Useful for studying how reasoning distillation scales to sub-4B models and how registered thinking tokens affect small-model behaviour.

Not intended for: production use, mathematical proofs requiring reliability, or replacing a larger reasoning model. Benchmark regressions vs base are expected until verified otherwise.

How to Get Started

From the adapter

The LoRA adapter is available separately — load it on top of the base model without downloading the full merged weights.

Important: load the tokenizer from the adapter directory, not the base model. The adapter tokenizer carries the correct 128258-token vocabulary with <think>/</think> baked in. Using the base model tokenizer (128256) will cause an embedding dimension mismatch.

from unsloth import FastLanguageModel
from transformers import AutoTokenizer, TextStreamer
from peft import PeftModel

ADAPTER_PATH = "codestrate/Llama3.2-3B-Claude-Reasoning-Distill"

model, _ = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    load_in_4bit=True,
    max_seq_length=2048,
)
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)  # vocab=128258
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
FastLanguageModel.for_inference(model)

SYSTEM_PROMPT = "You are a helpful assistant. Think step by step inside <think>...</think> before giving your final answer."
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Write a Python function to check if a number is prime."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(
    input_ids=inputs,
    streamer=streamer,
    max_new_tokens=1024,
    temperature=0.7,
    min_p=0.1,
    repetition_penalty=1.3,
    no_repeat_ngram_size=6,
    use_cache=True,
)

From GGUF (Ollama / LM Studio)

A Modelfile is included for Ollama. For direct use:

ollama run hf.co/codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Training Details

Dataset

angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k — instruct_train.jsonl split (full instruct + reasoning, ~7,700 examples). Data already in OpenAI messages format; mapped directly through apply_chat_template with no additional preprocessing.

The previous 1B run used only the coding + math categories (~2,000 examples). This run uses the full instruct split for broader coverage.

Hyperparameters

Parameter	Value
LoRA Rank / Alpha	32 / 64
Target Modules	All
Sequence Length	2048
Effective Batch	16 (2 × grad_accum 8)
Steps	904 (~2 epochs)
Learning Rate	1e-4 / cosine
Warmup Steps	50
Optimizer	adamw_8bit
Weight Decay	0.01
Precision	bfloat16

Loss Curve

Step	Loss	Step	Loss	Step	Loss
50	2.1372	350	1.8798	650	1.7567
100	1.9597	400	1.8512	700	1.7530
150	1.9251	450	1.8493	750	1.7391
200	1.8972	500	1.7670	800	1.7709
250	1.8891	550	1.7707	850	1.7401
300	1.8738	600	1.7668	900	1.7598

Drop: 2.14 → 1.74 (~0.40 absolute). Visible cross-epoch improvement at step ~452 (−0.082). Plateau reached in epoch 2 from step 750 — a third epoch would not have been beneficial on this dataset.

Known Limitations

Benchmarks not yet available — results will be added when the evaluation runs complete
Echolalia / repetition — reduced vs the 1B run due to special token boundaries, but not eliminated; repetition_penalty=1.3 and no_repeat_ngram_size=6 are recommended at inference (needs more testing)
System prompt required — without the <think>...</think> contract in the system prompt, the model may not cleanly transition from reasoning block to final answer
Not a production model — a research artefact studying reasoning distillation at sub-4B scale

Available Files

File	Format	Use
`Llama-3.2-3B-Claude-Reasoning-Distill.Q4_K_M.gguf`	GGUF Q4_K_M	LM Studio / Ollama (recommended)
`Llama-3.2-3B-Claude-Reasoning-Distill.Q8_0.gguf`	GGUF Q8	Higher fidelity inference (near lossless; still lightweight)
`Llama-3.2-3B-Claude-Reasoning-Distill.F16.gguf`	GGUF F16	Full precision GGUF
Adapter (`adapter_model.safetensors`)	LoRA adapter	PEFT inference / further fine-tuning

Framework Versions

Python 3.12.13
Unsloth 2026.5.8
PEFT 0.19.1
TRL 0.24.0
PyTorch 2.10.0+cu128
Transformers 4.47.1

Predecessor: Llama3.2-1B-Claude-Opus-Reasoning-Distill
Trained 2x faster with Unsloth

Downloads last month: 655

GGUF

Model size

3B params

Architecture

llama

Hardware compatibility

4-bit

8-bit

16-bit

Model tree for codestrate/Llama3.2-3B-Claude-Reasoning-Distill

Base model

meta-llama/Llama-3.2-3B-Instruct

Quantized

unsloth/Llama-3.2-3B-Instruct-bnb-4bit