Instructions to use daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1
- SGLang
How to use daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1 with Docker Model Runner:
docker model run hf.co/daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1
Qwen3-3B — 25% Compressed from Qwen3-4B (English · Chat)
This repository is part of the Efficient and Robust AI System (E-AI) Project by Vincent-Daniel Yun, which publicly releases compressed large language models. This model is a compressed edition of Qwen/Qwen3-4B with 9 of 36 transformer layers removed (27 layers remain, ≈3B parameters), then instruction-tuned for chat so it generates coherent responses at lower memory and latency.
🔗 Project: https://www.worldwidedaniel.com/eai-project 📅 Release date: 2026-06-28 · Version: V1
⚠️ Language support — English only. Tuned on English data. Other languages may produce degraded or broken output. For open-domain factual questions, use retrieval (RAG) — the compressed model can hallucinate specific facts.
💡 Best used as a fast, low-cost discrimination / classification engine (text classification, safety / content moderation, reading comprehension, domain QA, preference scoring) rather than an open-ended long-form generator. Compressed models preserve discrimination far better than generation.
About E-AI
Modern AI is powerful but heavy. State-of-the-art models are enormous and their inference is slow — still far from human intuition, and far too slow and unreliable to trust in urgent, high-stakes moments.
Two obstacles stand between today's models and AI we can trust in the field. Individually, each model is too large and too slow to run where it is actually needed. Collectively, when many models or agents work together, a single faulty or adversarial member can quietly derail the whole system. E-AI attacks both — making every model lightweight and fast, and keeping teams of agents reliable even when some of them fail.
I started the E-AI (Efficient-AI) project to build compact yet powerful AI that can assist people in disaster scenarios — responding to dangerous accidents quickly and reliably when every second counts.
Method
The pruning method and the recovery method used to build this model are proprietary, undisclosed methods created by Vincent-Daniel Yun and are not released. The compressed model is then instruction-tuned for chat (distilled from the base model). Only the resulting model is shared here.
Results (measured)
All numbers below were measured by us. PPL is on a 2048-token context (lower is better); downstream tasks and MMLU are 0-shot accuracy via lm-eval-harness (higher is better).
| Metric | Qwen3-4B (base) | This model (25%) |
|---|---|---|
| PPL · WikiText2 ↓ | 13.64 | 45.12 |
| PPL · C4 ↓ | 18.14 | 40.59 |
| PPL · PTB ↓ | 24.49 | 63.04 |
| ARC-c ↑ | 0.5401 | 0.3430 |
| ARC-e ↑ | 0.7849 | 0.5539 |
| BoolQ ↑ | 0.8508 | 0.7917 |
| COPA ↑ | 0.8100 | 0.7000 |
| HellaSwag ↑ | 0.6845 | 0.4767 |
| OpenBookQA ↑ | 0.4000 | 0.3120 |
| RACE ↑ | 0.4182 | 0.3608 |
| RTE ↑ | 0.7581 | 0.7653 |
| WinoGrande ↑ | 0.6614 | 0.6338 |
| Avg. downstream (9) ↑ | 0.6564 | 0.5486 |
| MMLU ↑ | 0.6839 | 0.6540 |
Task-suitability evaluation (vs dense Qwen3-4B)
This 25% compressed model is best used as a fast, low-cost discrimination / classification engine. Full-test-set scores vs the dense Qwen3-4B base (higher is better):
| Task | Dense Qwen3-4B | This model (25%) |
|---|---|---|
| MMLU | 0.684 | 0.654 |
| Avg DS (9) | 0.656 | 0.549 |
| AG News (classif.) | 0.852 | 0.854 |
| SST-2 | 0.899 | 0.882 |
| BoolQ | 0.851 | 0.792 |
| MRPC | 0.762 | 0.757 |
| CB (NLI) | 0.679 | 0.661 |
| WiC | 0.594 | 0.505 |
| MultiRC | 0.158 | 0.572 |
| MedQA | 0.572 | 0.560 |
| MedMCQA | 0.532 | 0.522 |
| PubMedQA | 0.768 | 0.700 |
| Belebele-en | 0.897 | 0.870 |
| TruthfulQA | 0.548 | 0.529 |
| LLM-judge | 0.842 | 0.838 |
| SafetyBench | 0.771 | 0.771 |
Takeaway. At 25% compression the model retains most of the dense Qwen3-4B's discrimination ability (classification, reading comprehension, NLI, medical QA, judging, safety screening) at smaller size and lower latency, while open-ended generation/commonsense degrade more. Use it as a discrimination engine, not a free-form generator.
Model family — pick your size
| Model | Layers | Params | Mem (fp16) |
|---|---|---|---|
| Qwen3-4B (base) | 36 | 4.02B | 11.79 GB |
| Qwen3-3B-20pct-Compressed-4B-EN-V1 | 29 | 3.32B | 10.15 GB |
| Qwen3-3B-25pct-Compressed-4B-EN-V1 | 27 | 3.11B | 9.67 GB ← this model |
| Qwen3-3B-30pct-Compressed-4B-EN-V1 | 25 | 2.91B | 9.2 GB |
Efficiency (measured, fp16, single GPU)
| Qwen3-4B (base) | This model (25%) | |
|---|---|---|
| Layers | 36 | 27 |
| Parameters | 4.02B | 3.11B |
| Peak inference memory (fp16) | 11.79 GB | 9.67 GB (−18%) |
| Forward latency (fp16) | 793 ms | 602 ms (−24%) |
Quantization
This is a standard Qwen3 architecture, so bitsandbytes 4-bit / 8-bit loading and other PTQ
methods apply on top of the compression for further memory savings.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
m = AutoModelForCausalLM.from_pretrained(
"Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True, device_map="cuda",
quantization_config=BitsAndBytesConfig(load_in_4bit=True))
Usage — Transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained(
"Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True, dtype=torch.float16, device_map="cuda")
tok = AutoTokenizer.from_pretrained("Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True)
ids = tok("The capital of France is", return_tensors="pt").to("cuda")
print(tok.decode(m.generate(**ids, max_new_tokens=20)[0]))
trust_remote_code=True is required: the model ships a small custom decoder layer in
modeling_qwen3_recovered.py.
Usage — vLLM
A tiny plugin (in vllm_plugin/) registers the custom decoder layer:
pip install ./vllm_plugin
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True, dtype="float16")
print(llm.generate(["The capital of France is"], SamplingParams(max_tokens=20))[0].outputs[0].text)
License
Apache-2.0, inherited from the base model Qwen/Qwen3-4B.
Acknowledgements
Sincere thanks to Prof. Sai Praneeth Karimireddy (University of Southern California) for his invaluable advice and feedback throughout this work, and to Prof. Sunwoo Lee (Inha University) for his guidance and support. We are also grateful to Alibaba (the Qwen team) for openly releasing the Qwen3-4B base model that made this work possible.
- Downloads last month
- -


