Instructions to use daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1
- SGLang
How to use daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1 with Docker Model Runner:
docker model run hf.co/daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1
Qwen3-11B — 25% Compressed from Qwen3-14B (English · Chat)
This repository is part of the Efficient and Robust AI System (E-AI) Project by Vincent-Daniel Yun, which publicly releases compressed large language models. This model is a compressed edition of Qwen/Qwen3-14B with 10 of 40 transformer layers removed (30 layers remain, ≈11B parameters), then instruction-tuned for chat so it generates coherent responses at lower memory and latency.
🔗 Project: https://www.worldwidedaniel.com/eai-project 📅 Release date: 2026-06-22 · Version: V1
⚠️ Language support — English only. This model is tuned on English data and is English-focused. Other languages (e.g., Korean, Chinese, Japanese) are not officially supported and may produce degraded or broken output. For open-domain factual questions, use retrieval (RAG) — the compressed model can hallucinate specific facts.
About E-AI
Modern AI is powerful but heavy. State-of-the-art models are enormous and their inference is slow — still far from human intuition, and far too slow and unreliable to trust in urgent, high-stakes moments.
Two obstacles stand between today's models and AI we can trust in the field. Individually, each model is too large and too slow to run where it is actually needed. Collectively, when many models or agents work together, a single faulty or adversarial member can quietly derail the whole system. E-AI attacks both — making every model lightweight and fast, and keeping teams of agents reliable even when some of them fail.
I started the E-AI (Efficient-AI) project to build compact yet powerful AI that can assist people in disaster scenarios — responding to dangerous accidents quickly and reliably when every second counts.
Method
The pruning method and the recovery method used to build this model are proprietary, undisclosed methods created by Vincent-Daniel Yun and are not released. The compressed model is then instruction-tuned for chat (distilled from the base model). Only the resulting model is shared here.
Results (measured)
All numbers below were measured by us. PPL is on a 2048-token context (lower is better); downstream tasks and MMLU are 0-shot accuracy via lm-eval-harness (higher is better).
| Metric | Qwen3-14B (base) | This model (25%) |
|---|---|---|
| PPL · WikiText2 ↓ | 8.64 | 23.34 |
| PPL · C4 ↓ | 13.00 | 26.31 |
| PPL · PTB ↓ | 14.79 | 35.54 |
| ARC-c ↑ | 0.6024 | 0.4556 |
| ARC-e ↑ | 0.8279 | 0.6894 |
| BoolQ ↑ | 0.8933 | 0.6263 |
| COPA ↑ | 0.9000 | 0.8000 |
| HellaSwag ↑ | 0.7881 | 0.6443 |
| OpenBookQA ↑ | 0.4620 | 0.3740 |
| RACE ↑ | 0.4325 | 0.3933 |
| RTE ↑ | 0.7762 | 0.7545 |
| WinoGrande ↑ | 0.7317 | 0.6488 |
| Avg. downstream (9) ↑ | 0.7127 | 0.5985 |
| MMLU ↑ | 0.7729 | 0.6801 |
Model family — pick your size
All sizes in this release (click to open each model). Memory is measured peak inference (fp16 and 4-bit, batch 4 × 2048, single 48 GB GPU).
| Model | Layers | Params | MMLU ↑ | Avg DS ↑ | Mem fp16 | Mem 4-bit |
|---|---|---|---|---|---|---|
| Qwen3-14B (base, uncompressed) | 40 | 14.77B | 0.773 | 0.713 | 33.5 GB | 13.9 GB |
| Qwen3-12B-20pct-Compressed-14B-EN-V1 | 32 | 12.13B | 0.722 | 0.639 | 27.9 GB | 12.25 GB |
| ➡ 25% (this model) | 30 | 11.47B | 0.680 | 0.598 | 26.5 GB | 11.84 GB |
| Qwen3-11B-30pct-Compressed-14B-EN-V1 | 28 | 10.80B | 0.626 | 0.560 | 25.1 GB | 11.44 GB |
Compressed + Quantization — GPU memory vs dense
How little GPU memory each option needs relative to the original dense fp16 model (lower is better; combine compression with 4-bit for the largest savings).
| Configuration | Peak GPU memory | vs dense fp16 |
|---|---|---|
| Qwen3-14B dense (fp16) | 33.5 GB | 100% |
| 20% compressed (fp16) | 27.9 GB | 83% |
| 25% compressed (fp16) ⬅ | 26.5 GB | 79% |
| 30% compressed (fp16) | 25.1 GB | 75% |
| 20% compressed + 4-bit | 12.25 GB | 37% |
| 25% compressed + 4-bit ⬅ | 11.84 GB | 35% |
| 30% compressed + 4-bit | 11.44 GB | 34% |
Efficiency (measured, fp16, batch 4 × 2048, single 48 GB GPU)
| Qwen3-14B (base) | This model (25%) | |
|---|---|---|
| Layers | 40 | 30 |
| Parameters | 14.77B | 11.47B |
| Peak inference memory (fp16) | 33.5 GB | 26.5 GB (−21%) |
| Peak inference memory (4-bit) | 13.9 GB | 11.84 GB (−65% vs dense fp16) |
| Forward latency (fp16) | 2246 ms | 1748 ms (−22%) |
Quantization
4-bit (and other) quantization can be used with this model — it is a standard Qwen3
architecture, so bitsandbytes 4-bit / 8-bit loading and other PTQ methods apply on top of
the compression. Verified: this model loads and generates correctly in 4-bit, with peak
inference memory ~11.84 GB (vs 13.9 GB for the dense model in 4-bit, and
33.5 GB for the dense model in fp16).
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
m = AutoModelForCausalLM.from_pretrained(
"daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True, device_map="cuda",
quantization_config=BitsAndBytesConfig(load_in_4bit=True))
Usage — Transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained(
"daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True, dtype=torch.float16, device_map="cuda")
tok = AutoTokenizer.from_pretrained("daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True)
ids = tok("The capital of France is", return_tensors="pt").to("cuda")
print(tok.decode(m.generate(**ids, max_new_tokens=20)[0]))
trust_remote_code=True is required: the model ships a small custom decoder layer in
modeling_qwen3_recovered.py.
Usage — vLLM
vLLM uses its own model implementations, so the custom decoder layer is loaded via a tiny
plugin (provided in this repo under vllm_plugin/). Install it once, then serve normally:
pip install ./vllm_plugin # from a checkout of this repo's vllm_plugin/ folder
from vllm import LLM, SamplingParams
llm = LLM(model="daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True, dtype="float16")
print(llm.generate(["The capital of France is"], SamplingParams(max_tokens=20))[0].outputs[0].text)
Other backends: TGI / SGLang / llama.cpp each use their own model graphs and would need an analogous custom decoder layer; they are not supported out of the box.
License
Apache-2.0, inherited from the base model Qwen/Qwen3-14B.
Acknowledgements
Sincere thanks to Prof. Sai Praneeth Karimireddy (University of Southern California) for his invaluable advice and feedback throughout this work, and to Prof. Sunwoo Lee (Inha University) for his guidance and support. We are also grateful to Alibaba (the Qwen team) for openly releasing the Qwen3-14B base model that made this work possible.
- Downloads last month
- 43

