Instructions to use 0xSero/Qwen3.5-76B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 0xSero/Qwen3.5-76B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="0xSero/Qwen3.5-76B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("0xSero/Qwen3.5-76B") model = AutoModelForImageTextToText.from_pretrained("0xSero/Qwen3.5-76B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use 0xSero/Qwen3.5-76B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "0xSero/Qwen3.5-76B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/Qwen3.5-76B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/0xSero/Qwen3.5-76B
- SGLang
How to use 0xSero/Qwen3.5-76B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "0xSero/Qwen3.5-76B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/Qwen3.5-76B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "0xSero/Qwen3.5-76B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/Qwen3.5-76B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use 0xSero/Qwen3.5-76B with Docker Model Runner:
docker model run hf.co/0xSero/Qwen3.5-76B
Support this work → · X · GitHub · REAP paper · Cerebras REAP
Qwen3.5-76B
REAP-pruned Qwen/Qwen3.5-122B-A10B.
At a glance
| Base model | Qwen/Qwen3.5-122B-A10B |
| Format | BF16 |
| Total params | 76B |
| Active / token | 10B |
| Experts / layer | — |
| Layers | — |
| Hidden size | — |
| Context | — |
| On-disk size | 152 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
Qwen3.5-264B |
BF16 | link |
Qwen3.5-264B-FP8 |
FP8 | link |
Qwen3.5-264B-W4A16 |
W4A16 | link |
Qwen3.5-28B |
BF16 | link |
Qwen3.5-35B-EXL3-4bpw |
EXL3-4bpw | link |
Qwen3.5-76B (this) |
BF16 | link |
Qwen3.5-76B-GGUF |
GGUF | link |
Qwen3.5-88B |
BF16 | link |
Qwen3.5-99B |
BF16 | link |
Qwen3.5-99B-GGUF |
GGUF | link |
40% expert-pruned variant of Qwen3.5-122B-A10B using REAP (Routing-Enhanced Activation Pruning).
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-122B-A10B |
| Architecture | Qwen3.5 MoE (GDN + Full Attention) |
| Original Experts | 256 per layer |
| Pruned Experts | 154 per layer (40% removed) |
| Active Parameters | ~10B per token |
| Pruning Method | REAP with targeted refusal preservation |
| Preserve Threshold | 80% (super-expert protection) |
| Calibration | reap-calibration-data-v1 — 23k benchmark-free samples |
| Maintainer | 0xSero |
| Organization | Sybil Solutions |
| Project | REAP PR17 |
Usage
vllm serve 0xSero/Qwen3.5-76B \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--max-model-len 8192 \
--trust-remote-code \
--language-model-only \
--dtype bfloat16
Important: Use --language-model-only flag — this is a text-only checkpoint pruned from the multimodal base model.
What is REAP?
REAP (Routing-Enhanced Activation Pruning) removes the least-activated experts from MoE models while preserving critical capabilities. It uses router activation patterns from a calibration dataset to identify dispensable experts, with special protection for safety-critical behaviors.
License
Same license as the base model (Qwen).
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
- Downloads last month
- 23
Model tree for 0xSero/Qwen3.5-76B
Base model
Qwen/Qwen3.5-122B-A10B