Instructions to use enCoder/qwen3-5-4b-mlp7808-distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use enCoder/qwen3-5-4b-mlp7808-distilled with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="enCoder/qwen3-5-4b-mlp7808-distilled")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("enCoder/qwen3-5-4b-mlp7808-distilled")
model = AutoModelForImageTextToText.from_pretrained("enCoder/qwen3-5-4b-mlp7808-distilled")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use enCoder/qwen3-5-4b-mlp7808-distilled with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "enCoder/qwen3-5-4b-mlp7808-distilled"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "enCoder/qwen3-5-4b-mlp7808-distilled",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/enCoder/qwen3-5-4b-mlp7808-distilled

SGLang

How to use enCoder/qwen3-5-4b-mlp7808-distilled with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "enCoder/qwen3-5-4b-mlp7808-distilled" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "enCoder/qwen3-5-4b-mlp7808-distilled",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "enCoder/qwen3-5-4b-mlp7808-distilled" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "enCoder/qwen3-5-4b-mlp7808-distilled",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use enCoder/qwen3-5-4b-mlp7808-distilled with Docker Model Runner:
```
docker model run hf.co/enCoder/qwen3-5-4b-mlp7808-distilled
```

Qwen3.5-4B — Wanda-Pruned + Distilled (MLP 9216→7808), Sharded

A structurally MLP-pruned and knowledge-distilled derivative of Qwen/Qwen3.5-4B, saved as 16 safetensors shards (~ 7.9 GB, bf16). The SwiGLU MLP intermediate dimension is reduced 9216 → 7808 (~15.3% MLP pruning, a 128-multiple for kernel alignment) via Wanda importance scoring, then recovered with logit-level knowledge distillation from the original bf16 model (and a LoRA recovery pass).

⚠️ Research artifact. Pruning measurably reduced instruction-following ability (IFEval ~0.87 → ~0.71; see Limitations). Best used as a smaller, faster base for further fine-tuning — not as a faithful drop-in replacement for the parent on instruction- or format-strict tasks.

Model Details

Model Description

This checkpoint takes the hybrid (linear-attention + full-attention) Qwen3.5-4B, narrows every SwiGLU MLP block's intermediate width from 9216 to 7808 by keeping the top-scoring channels under the Wanda criterion (|W| · ‖X‖₂), and re-trains the slimmed model to track the original via knowledge distillation. Attention layers, the embedding, and the tied LM head are left unpruned.

Developed by: [Suraj Sharan]
Funded by [optional]: [N/A]
Shared by [optional]: [Suraj Sharan]
Model type: Decoder-only causal LM, hybrid attention (linear + full), structurally pruned + distilled
Language(s) (NLP): English (pruning/distillation data was English; other languages inherited from the parent and untested here)
License: Inherits the Qwen3.5-4B license as a derivative (set to the base model's terms; do not relicense to MIT without confirming compatibility)
Finetuned from model [optional]: Qwen/Qwen3.5-4B

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: N/A
Demo [optional]: N/A

Uses

Direct Use

General-purpose text generation and research on structured pruning + distillation of small hybrid LLMs.

Downstream Use [optional]

A lighter base for task-specific fine-tuning. A short instruction-following SFT/DPO pass is recommended before assistant-style use to recover the pruning regression.

Out-of-Scope Use

Not a faithful substitute for Qwen3.5-4B on strict instruction-following, exact-format output, or safety-critical applications, given the measured IFEval regression.

Bias, Risks, and Limitations

Instruction-following regression (key limitation): on IFEval (inst_level_strict_acc), the original Qwen3.5-4B scores ≈0.87 while this pruned+distilled checkpoint scores ≈0.71. MLP pruning removed capacity that instruction-following relies on, and generic-corpus distillation did not recover it.
Inherited biases: carries the biases, knowledge cutoff, and safety profile of the parent.
Generic fidelity ≠ task parity: ~98% next-token argmax agreement with the teacher does not imply parity on instruction- or reasoning-heavy benchmarks.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Run an instruction-following SFT (or DPO) pass before assistant-style deployment, and evaluate on your own target tasks rather than relying on the parent's reported numbers.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model_id = "your-username/your-repo"  # or local path to this folder
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = Qwen3_5ForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, trust_remote_code=True
).eval().cuda()

msgs = [{"role": "user", "content": "Explain Wanda pruning in two sentences."}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").cuda()
out = model.generate(inputs, max_new_tokens=128, do_sample=False)
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))

Training Details

Training Data

yahma/alpaca-cleaned (open instruction corpus) served as both the Wanda calibration source and the distillation corpus. No proprietary or private data was used.

Training Procedure

Preprocessing [optional]

Calibration activations were collected over the corpus to compute per-channel Wanda scores (|W| · ‖X‖₂) for each MLP intermediate channel; sequences truncated to the training sequence length.

Training Hyperparameters

Training regime: bf16 mixed precision
Pruning: Wanda scoring → MLP intermediate 9216 → 7808 (gate/up output rows + down-proj input columns pruned consistently; attention, embeddings, LM head untouched)
Distillation objective: KL(student ‖ frozen bf16 teacher) over full-vocabulary logits, computed in fp32, chunked over the sequence dimension
Distillation steps: ~12,000
LoRA recovery: rank 32 on gate/up/down + q/k/v/o projections, distilled then merged

Speeds, Sizes, Times [optional]

Parameters: ~4B (MLP intermediate slimmed ~15% vs parent)
Format: bf16, 16 safetensors shards (~7.9 GB total)
Teacher–student argmax agreement: ~98% on the distillation corpus

Evaluation

Testing Data, Factors & Metrics

Testing Data

IFEval prompt set; the distillation corpus for argmax-agreement checks.

Factors

Generic next-token fidelity vs instruction-following capability.

Metrics

Argmax agreement vs the bf16 teacher (generic fidelity)
IFEval inst_level_strict_acc (instruction-following)

Results

Metric	Original Qwen3.5-4B	This model
Teacher argmax agreement	100% (def.)	~98%
IFEval (inst_level_strict_acc)	~0.87	~0.71

Summary

High generic-text fidelity to the parent, with a clear instruction-following regression introduced by MLP pruning. Treat as a research checkpoint / fine-tuning base.

Model Examination [optional]

Pruning was concentrated in the SwiGLU MLP intermediate dimension; attention (linear + full), embeddings, and the tied LM head were preserved.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA H100 (80GB)
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

Hybrid Qwen3.5-4B backbone (Qwen3_5ForConditionalGeneration): 32 decoder layers (mix of linear-attention and full-attention), hidden size 2560, MLP intermediate 7808 (pruned from 9216), grouped-query attention (16 query / 4 KV heads, head_dim 256), vocabulary 248,320, tied input/output embeddings. Objective: knowledge distillation toward the bf16 parent.

Compute Infrastructure

Hardware

NVIDIA H100 (80GB), multi-GPU data-parallel.

Software

PyTorch, Hugging Face Transformers, PEFT, Accelerate.

Citation [optional]

BibTeX:

@misc{qwen3_5,
  title  = {Qwen3.5},
  author = {Qwen Team},
  year   = {2025}
}

APA:

Qwen Team. (2025). Qwen3.5. (Base model for this pruned + distilled derivative.)

Glossary [optional]

Wanda: pruning by weight magnitude × input activation norm (|W| · ‖X‖₂).
MLP intermediate dimension: the inner width of the SwiGLU feed-forward block, reduced here from 9216 to 7808.
Argmax agreement: fraction of positions where this model's top token matches the teacher's.

More Information [optional]

Pruning targeted only the MLP intermediate width; for stronger instruction-following, fine-tune downstream.

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month: 17

Model tree for enCoder/qwen3-5-4b-mlp7808-distilled

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(265)

this model

Paper for enCoder/qwen3-5-4b-mlp7808-distilled

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 48