Instructions to use kai-os/Grug-35B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kai-os/Grug-35B-A3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kai-os/Grug-35B-A3B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("kai-os/Grug-35B-A3B") model = AutoModelForCausalLM.from_pretrained("kai-os/Grug-35B-A3B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use kai-os/Grug-35B-A3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kai-os/Grug-35B-A3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kai-os/Grug-35B-A3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kai-os/Grug-35B-A3B
- SGLang
How to use kai-os/Grug-35B-A3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kai-os/Grug-35B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kai-os/Grug-35B-A3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kai-os/Grug-35B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kai-os/Grug-35B-A3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use kai-os/Grug-35B-A3B with Docker Model Runner:
docker model run hf.co/kai-os/Grug-35B-A3B
Grug 35B A3B
Grug 35B A3B is a compact-reasoning fine-tune of Qwen/Qwen3.6-35B-A3B.
It keeps the Qwen MoE/A3B architecture: 35B total parameters with roughly 3B
activated per token, 40 text layers, 256 experts, and 8 routed experts per
token plus the shared expert path.
This repository is published as merged Transformers/safetensors model weights. It was trained with QLoRA/PEFT LoRA, then merged into the base model before upload. You do not need a separate adapter to load this model.
What Changed
The training target is a terse internal-reasoning style: short high-density notes, fewer filler phrases, and stronger preservation of constraints, equations, checks, bug causes, decisive branches, edge cases, and final-answer validation.
The goal is lower reasoning-token usage relative to the base model while preserving answer quality. It is not meant to hide uncertainty or remove needed reasoning.
Architecture
The merged release is a Qwen MoE text-generation model:
- Base model:
Qwen/Qwen3.6-35B-A3B. - Released architecture:
Qwen3_5MoeForCausalLM. - Model type:
qwen3_5_moe_text. - Hidden size: 2048.
- Text layers: 40.
- Experts: 256.
- Activated experts: 8 routed experts plus shared expert path.
- Native context length from the base model family: 262,144 tokens.
Training Data
The data pipeline started from a recent, filtered reasoning pool and converted verbose traces into compact traces before SFT packing.
Source gate:
- Run date: June 30, 2026.
- Default freshness cutoff: 45 days. Sources older than May 16, 2026 were rejected unless manually allowed.
- Allowed train licenses: MIT, Apache-2.0, CC-BY-4.0, CC0-1.0.
- Hard reject terms included OpenAI, ChatGPT, GPT-5, Claude, Anthropic, Opus, Sonnet, and Gemini.
- Soft-risk sources marked as synthetic/distill were manually reviewed or rejected depending on provenance and license.
Final verified source mix:
| Source | License | Domain | Verified rows |
|---|---|---|---|
hotdogs/uka-glm-5.2 |
MIT | agent code | 1,617 |
Scale-or-Reason/general-reasoning-ift-pairs |
MIT | general reasoning | 1,305 |
samcheng0/lumia-reasoning-sft-v1 |
Apache-2.0 | code reasoning | 1,103 |
HSH-Intelligence/verified-math-reasoning-3k |
Apache-2.0 | math | 672 |
kd13/CodeDebug-Instruct-v2-Reasoning |
MIT | code debug | 600 |
Madarabr/cortex-adaptive-thinking |
Apache-2.0 | adaptive reasoning | 300 |
CL-From-Nothing/code_rose_initial_1_7B_SFT_10K_rollouts_Qwen3-4B-Thinking-2507_k12_t0.7_maxtok12288 |
Apache-2.0 | code reasoning | 143 |
Row counts:
- Normalized recent reasoning pool: 8,680 rows.
- Selected verbose reasoning set: 6,144 rows.
- Compact raw transform output: 6,144 rows.
- Clean packed SFT split: 4,517 train / 245 validation / 247 test.
- Qwen training rows accepted after length filter: 4,385.
- Qwen training rows skipped by length filter: 132.
- Evaluation rows used during training: 64.
The compact reasoning transform was generated with
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit served by vLLM. Rows were checked for
compression ratio, answer preservation, malformed tags, repetition, fixed
reasoning labels, tone issues, and obvious loss of critical information before
training.
Training Procedure
Training was completion-only SFT: prompt tokens were masked with -100, and
only the assistant completion was trained.
Core settings:
- Base model:
Qwen/Qwen3.6-35B-A3B. - Method: QLoRA / PEFT LoRA, merged into full model weights for upload.
- Quantization during training: 4-bit NF4 with BF16 compute.
- Hardware: 8x NVIDIA Tesla V100-SXM2 16GB.
- Max sequence length: 1,792.
- LoRA rank: 4.
- LoRA alpha: 8.
- Batch size: 1.
- Gradient accumulation: 4.
- Learning rate: 8e-5.
- Max steps: 100.
- Eval steps: 25.
- Save steps: 25.
- Train runtime: about 1 hour 41 minutes 44 seconds.
- Train samples per second: 0.066.
- Train steps per second: 0.016.
- Train loss: 1.051.
Validation loss:
| Step | Eval loss |
|---|---|
| 25 | 1.1510 |
| 50 | 1.0793 |
| 75 | 1.0487 |
| 100 | 1.0399 |
Local Smoke Test
Final checkpoint smoke outputs were compact and did not use fixed
Goal/Rule/Logic/Edge labels.
Example:
<think>
Shirt 80, 25% off. 25% of 80 = 20. 80 - 20 = 60. Check: 60/80 = 0.75. Correct. Answer 60.
</think>
The sale price is $60.
Bug-fix smoke output correctly identified len(x) as the issue and changed it
to len(xs).
This is a smoke test, not a broad benchmark. Run your own evals before relying on the model in sensitive or production settings.
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "kai-os/Grug-35B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model.eval()
messages = [
{"role": "user", "content": "If a shirt is $80 and goes 25% off, what is the sale price?"}
]
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
with torch.no_grad():
output = model.generate(inputs, do_sample=False, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))
For token-efficiency tests, compare against the base model with the same prompt and decoding settings. Do not use an artificial generation cap for benchmark claims unless the deployment itself requires one.
Limitations
- This is an experimental compact-reasoning fine-tune.
- The training run was intentionally small and should be treated as a first Qwen/A3B checkpoint, not a fully benchmarked production model.
- It may over-compress reasoning on tasks that need longer derivations.
- It inherits the base model's limitations and safety behavior.
- The reported evaluation is local and limited.
- The dataset includes synthetic and distilled reasoning traces from the listed open datasets; review source licenses and provenance before using this in commercial or sensitive settings.
Acknowledgements
Thanks to Lambda, the inference provider, for compute credits that supported the dataset work, training, and evaluation.
- Downloads last month
- 266
Model tree for kai-os/Grug-35B-A3B
Datasets used to train kai-os/Grug-35B-A3B
Scale-or-Reason/general-reasoning-ift-pairs
samcheng0/lumia-reasoning-sft-v1
Collection including kai-os/Grug-35B-A3B
Evaluation results
- Final validation loss at step 100 on Local compact reasoning smoke promptsself-reported1.040
