Instructions to use 0xSero/Qwen3-Coder-64B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/Qwen3-Coder-64B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/Qwen3-Coder-64B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/Qwen3-Coder-64B")
model = AutoModelForCausalLM.from_pretrained("0xSero/Qwen3-Coder-64B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 0xSero/Qwen3-Coder-64B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/Qwen3-Coder-64B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/Qwen3-Coder-64B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/Qwen3-Coder-64B

SGLang

How to use 0xSero/Qwen3-Coder-64B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/Qwen3-Coder-64B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/Qwen3-Coder-64B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/Qwen3-Coder-64B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/Qwen3-Coder-64B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/Qwen3-Coder-64B with Docker Model Runner:
```
docker model run hf.co/0xSero/Qwen3-Coder-64B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Support this work → · X · GitHub · REAP paper · Cerebras REAP

Qwen3-Coder-64B

REAP-pruned Qwen/Qwen3-Coder-Next.

At a glance


Base model	Qwen/Qwen3-Coder-Next
Format	BF16
Total params	64B
Active / token	—
Experts / layer	410
Layers	48
Hidden size	2048
Context	262,144
On-disk size	129 GB

Which variant should I pick?

Variant	Format	Link
`Qwen3-Coder-57B`	BF16	link
`Qwen3-Coder-64B` (this)	BF16	link

20% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).

	Original	This Model
Total params	~80B	64.26B
Experts	512	410
Active params/tok	~4.2B	~4.2B
Experts/tok	10	10
Format	BF16	BF16
Disk size	~149 GB	~129 GB

REAP removes 20% of MoE experts (102 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~14% reduction in total disk/memory footprint with minimal quality loss.

Method

REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:

Router gate values -- how often and how strongly the router selects each expert
Expert activation norms -- magnitude of each expert's output contribution
Frequency-weighted saliency -- combining routing frequency with activation importance
Router logit renormalization -- maintains output distribution after expert removal
Layerwise application -- independent per-layer pruning decisions for stability

Calibration Dataset

22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:

Category	Samples	Source
Coding (general)	4,096	`theblackcat102/evol-codealpaca-v1`
Reasoning (code)	~2,680	`open-r1/Mixture-of-Thoughts[code]`
Reasoning (math)	~2,778	`open-r1/Mixture-of-Thoughts[math]`
Reasoning (science)	~2,776	`open-r1/Mixture-of-Thoughts[science]`
Tool calling	4,096	`Salesforce/xlam-function-calling-60k`
Agentic coding	4,096	`SWE-bench/SWE-smith-trajectories`
+ extended domains	~1,478	Scientific, CUDA kernels, browser, advanced math, code correctness

Total tokens observed: ~90.5M across 6,391 packed sequences.

Pruning Configuration

Parameter	Value
Compression ratio	0.20 (20% expert removal)
Original experts per layer	512
Remaining experts per layer	410
Pruning method	REAP
Distance measure	Angular (cosine)
Router weight renormalization	Yes
Seed	42
Observation batch size	8
Calibration batches	128 per category

Benchmark Results

10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:

Task	Metric	Original	REAP 0.20	Delta
ARC-Challenge	acc_norm	58.5%	64.0%	+5.5
BoolQ	acc	93.0%	91.0%	-2.0
CommonsenseQA	acc	89.0%	88.0%	-1.0
GSM8K	flexible_extract	35.0%	28.5%	-6.5
HellaSwag	acc_norm	72.0%	66.0%	-6.0
MathQA	acc_norm	60.5%	53.5%	-7.0
OpenBookQA	acc_norm	48.5%	49.0%	+0.5
PIQA	acc_norm	80.0%	80.5%	+0.5
TruthfulQA MC2	acc	60.2%	55.2%	-5.0
WinoGrande	acc	70.0%	70.0%	+0.0

Aggregate:

Overall average: 66.7% -> 64.6% (-2.1 pts)
Reasoning average: 71.4% -> 70.5% (-0.9 pts)
Math average: 47.8% -> 41.0% (-6.8 pts)

Architecture

Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:

Full attention every 4th layer (12 layers)
Linear attention for remaining layers (36 layers)
MoE FFN with 410 remaining experts per layer, 10 active per token
Shared expert (intermediate size 512) in every layer
Context window: 262,144 tokens
Vocab size: 151,936

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/Qwen3-Coder-64B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

vllm serve 0xSero/Qwen3-Coder-64B \
    --tensor-parallel-size 4 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768

Reproducing

git clone https://github.com/cerebras/reap
cd reap

python -m reap.layerwise_prune \
    --model-name Qwen/Qwen3-Coder-Next \
    --dataset-name combined \
    --compression-ratio 0.20 \
    --prune-method reap \
    --seed 42 \
    --renormalize_router_weights true \
    --batch_size 8 \
    --batches_per_category 128

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Model tree for 0xSero/Qwen3-Coder-64B

Base model

Qwen/Qwen3-Coder-Next

Finetuned

(35)

this model

Quantizations

3 models

Space using 0xSero/Qwen3-Coder-64B 1

Collections including 0xSero/Qwen3-Coder-64B

Proven REAPs

Collection

Benchmarked REAP checkpoints with >=500 all-time downloads. GLM/Qwen/MiniMax/DeepSeek/Kimi/gemma. • 20 items • Updated 2 days ago • 10

Qwen — REAP

Collection

REAP-pruned & quantized Qwen3.5 / 3.6 / Coder variants. • 15 items • Updated 3 days ago

Paper for 0xSero/Qwen3-Coder-64B

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20

Evaluation results

acc_norm on ARC-Challenge
self-reported

64.000
accuracy on BoolQ
self-reported

91.000
acc_norm on HellaSwag
self-reported

66.000
accuracy on WinoGrande
self-reported

70.000
acc_norm on PIQA
self-reported

80.500
accuracy on CommonsenseQA
self-reported

88.000
accuracy on TruthfulQA MC2
self-reported

55.200
acc_norm on OpenBookQA
self-reported

49.000

0xSero
/

Qwen3-Coder-64B

Qwen3-Coder-64B

At a glance

Which variant should I pick?

Method

Calibration Dataset

Pruning Configuration

Benchmark Results

Architecture

Usage

Transformers

vLLM

Reproducing

Links

License & citation

Sponsors

Model tree for 0xSero/Qwen3-Coder-64B

Space using 0xSero/Qwen3-Coder-64B 1

Collections including 0xSero/Qwen3-Coder-64B

Proven REAPs

Qwen — REAP

Paper for 0xSero/Qwen3-Coder-64B

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Evaluation results