Instructions to use 0xSero/Qwen3-Coder-64B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 0xSero/Qwen3-Coder-64B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="0xSero/Qwen3-Coder-64B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("0xSero/Qwen3-Coder-64B") model = AutoModelForCausalLM.from_pretrained("0xSero/Qwen3-Coder-64B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use 0xSero/Qwen3-Coder-64B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "0xSero/Qwen3-Coder-64B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/Qwen3-Coder-64B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/0xSero/Qwen3-Coder-64B
- SGLang
How to use 0xSero/Qwen3-Coder-64B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "0xSero/Qwen3-Coder-64B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/Qwen3-Coder-64B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "0xSero/Qwen3-Coder-64B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/Qwen3-Coder-64B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use 0xSero/Qwen3-Coder-64B with Docker Model Runner:
docker model run hf.co/0xSero/Qwen3-Coder-64B
Support this work → · X · GitHub · REAP paper · Cerebras REAP
Qwen3-Coder-64B
REAP-pruned Qwen/Qwen3-Coder-Next.
At a glance
| Base model | Qwen/Qwen3-Coder-Next |
| Format | BF16 |
| Total params | 64B |
| Active / token | — |
| Experts / layer | 410 |
| Layers | 48 |
| Hidden size | 2048 |
| Context | 262,144 |
| On-disk size | 129 GB |
Which variant should I pick?
20% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).
| Original | This Model | |
|---|---|---|
| Total params | ~80B | 64.26B |
| Experts | 512 | 410 |
| Active params/tok | ~4.2B | ~4.2B |
| Experts/tok | 10 | 10 |
| Format | BF16 | BF16 |
| Disk size | ~149 GB | ~129 GB |
REAP removes 20% of MoE experts (102 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~14% reduction in total disk/memory footprint with minimal quality loss.
Method
REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:
- Router gate values -- how often and how strongly the router selects each expert
- Expert activation norms -- magnitude of each expert's output contribution
- Frequency-weighted saliency -- combining routing frequency with activation importance
- Router logit renormalization -- maintains output distribution after expert removal
- Layerwise application -- independent per-layer pruning decisions for stability
Calibration Dataset
22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:
| Category | Samples | Source |
|---|---|---|
| Coding (general) | 4,096 | theblackcat102/evol-codealpaca-v1 |
| Reasoning (code) | ~2,680 | open-r1/Mixture-of-Thoughts[code] |
| Reasoning (math) | ~2,778 | open-r1/Mixture-of-Thoughts[math] |
| Reasoning (science) | ~2,776 | open-r1/Mixture-of-Thoughts[science] |
| Tool calling | 4,096 | Salesforce/xlam-function-calling-60k |
| Agentic coding | 4,096 | SWE-bench/SWE-smith-trajectories |
| + extended domains | ~1,478 | Scientific, CUDA kernels, browser, advanced math, code correctness |
Total tokens observed: ~90.5M across 6,391 packed sequences.
Pruning Configuration
| Parameter | Value |
|---|---|
| Compression ratio | 0.20 (20% expert removal) |
| Original experts per layer | 512 |
| Remaining experts per layer | 410 |
| Pruning method | REAP |
| Distance measure | Angular (cosine) |
| Router weight renormalization | Yes |
| Seed | 42 |
| Observation batch size | 8 |
| Calibration batches | 128 per category |
Benchmark Results
10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:
| Task | Metric | Original | REAP 0.20 | Delta |
|---|---|---|---|---|
| ARC-Challenge | acc_norm | 58.5% | 64.0% | +5.5 |
| BoolQ | acc | 93.0% | 91.0% | -2.0 |
| CommonsenseQA | acc | 89.0% | 88.0% | -1.0 |
| GSM8K | flexible_extract | 35.0% | 28.5% | -6.5 |
| HellaSwag | acc_norm | 72.0% | 66.0% | -6.0 |
| MathQA | acc_norm | 60.5% | 53.5% | -7.0 |
| OpenBookQA | acc_norm | 48.5% | 49.0% | +0.5 |
| PIQA | acc_norm | 80.0% | 80.5% | +0.5 |
| TruthfulQA MC2 | acc | 60.2% | 55.2% | -5.0 |
| WinoGrande | acc | 70.0% | 70.0% | +0.0 |
Aggregate:
- Overall average: 66.7% -> 64.6% (-2.1 pts)
- Reasoning average: 71.4% -> 70.5% (-0.9 pts)
- Math average: 47.8% -> 41.0% (-6.8 pts)
Architecture
Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:
- Full attention every 4th layer (12 layers)
- Linear attention for remaining layers (36 layers)
- MoE FFN with 410 remaining experts per layer, 10 active per token
- Shared expert (intermediate size 512) in every layer
- Context window: 262,144 tokens
- Vocab size: 151,936
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "0xSero/Qwen3-Coder-64B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
vLLM
vllm serve 0xSero/Qwen3-Coder-64B \
--tensor-parallel-size 4 \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--max-model-len 32768
Reproducing
git clone https://github.com/cerebras/reap
cd reap
python -m reap.layerwise_prune \
--model-name Qwen/Qwen3-Coder-Next \
--dataset-name combined \
--compression-ratio 0.20 \
--prune-method reap \
--seed 42 \
--renormalize_router_weights true \
--batch_size 8 \
--batches_per_category 128
Links
- REAP paper: arxiv.org/abs/2510.13999
- REAP code: github.com/cerebras/reap
- Cerebras REAP collection: huggingface.co/collections/cerebras/cerebras-reap
- Base model: Qwen/Qwen3-Coder-Next
- 30% pruned variant: 0xSero/Qwen3-Coder-57B
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
- Downloads last month
- 24
Model tree for 0xSero/Qwen3-Coder-64B
Space using 0xSero/Qwen3-Coder-64B 1
Collections including 0xSero/Qwen3-Coder-64B
Paper for 0xSero/Qwen3-Coder-64B
Evaluation results
- acc_norm on ARC-Challengeself-reported64.000
- accuracy on BoolQself-reported91.000
- acc_norm on HellaSwagself-reported66.000
- accuracy on WinoGrandeself-reported70.000
- acc_norm on PIQAself-reported80.500
- accuracy on CommonsenseQAself-reported88.000
- accuracy on TruthfulQA MC2self-reported55.200
- acc_norm on OpenBookQAself-reported49.000