Instructions to use daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1

SGLang

How to use daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1 with Docker Model Runner:
```
docker model run hf.co/daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1
```

Qwen3-11B — 25% Compressed from Qwen3-14B (English · Chat)

This repository is part of the Efficient and Robust AI System (E-AI) Project by Vincent-Daniel Yun, which publicly releases compressed large language models. This model is a compressed edition of Qwen/Qwen3-14B with 10 of 40 transformer layers removed (30 layers remain, ≈11B parameters), then instruction-tuned for chat so it generates coherent responses at lower memory and latency.

🔗 Project: https://www.worldwidedaniel.com/eai-project 📅 Release date: 2026-06-22 · Version: V1

⚠️ Language support — English only. This model is tuned on English data and is English-focused. Other languages (e.g., Korean, Chinese, Japanese) are not officially supported and may produce degraded or broken output. For open-domain factual questions, use retrieval (RAG) — the compressed model can hallucinate specific facts.

About E-AI

Modern AI is powerful but heavy. State-of-the-art models are enormous and their inference is slow — still far from human intuition, and far too slow and unreliable to trust in urgent, high-stakes moments.

Two obstacles stand between today's models and AI we can trust in the field. Individually, each model is too large and too slow to run where it is actually needed. Collectively, when many models or agents work together, a single faulty or adversarial member can quietly derail the whole system. E-AI attacks both — making every model lightweight and fast, and keeping teams of agents reliable even when some of them fail.

I started the E-AI (Efficient-AI) project to build compact yet powerful AI that can assist people in disaster scenarios — responding to dangerous accidents quickly and reliably when every second counts.

Method

The pruning method and the recovery method used to build this model are proprietary, undisclosed methods created by Vincent-Daniel Yun and are not released. The compressed model is then instruction-tuned for chat (distilled from the base model). Only the resulting model is shared here.

Results (measured)

All numbers below were measured by us. PPL is on a 2048-token context (lower is better); downstream tasks and MMLU are 0-shot accuracy via lm-eval-harness (higher is better).

Metric	Qwen3-14B (base)	This model (25%)
PPL · WikiText2 ↓	8.64	23.34
PPL · C4 ↓	13.00	26.31
PPL · PTB ↓	14.79	35.54
ARC-c ↑	0.6024	0.4556
ARC-e ↑	0.8279	0.6894
BoolQ ↑	0.8933	0.6263
COPA ↑	0.9000	0.8000
HellaSwag ↑	0.7881	0.6443
OpenBookQA ↑	0.4620	0.3740
RACE ↑	0.4325	0.3933
RTE ↑	0.7762	0.7545
WinoGrande ↑	0.7317	0.6488
Avg. downstream (9) ↑	0.7127	0.5985
MMLU ↑	0.7729	0.6801

Model family — pick your size

All sizes in this release (click to open each model). Memory is measured peak inference (fp16 and 4-bit, batch 4 × 2048, single 48 GB GPU).

Model	Layers	Params	MMLU ↑	Avg DS ↑	Mem fp16	Mem 4-bit
Qwen3-14B (base, uncompressed)	40	14.77B	0.773	0.713	33.5 GB	13.9 GB
Qwen3-12B-20pct-Compressed-14B-EN-V1	32	12.13B	0.722	0.639	27.9 GB	12.25 GB
➡ 25% (this model)	30	11.47B	0.680	0.598	26.5 GB	11.84 GB
Qwen3-11B-30pct-Compressed-14B-EN-V1	28	10.80B	0.626	0.560	25.1 GB	11.44 GB

Compressed + Quantization — GPU memory vs dense

How little GPU memory each option needs relative to the original dense fp16 model (lower is better; combine compression with 4-bit for the largest savings).

Configuration	Peak GPU memory	vs dense fp16
Qwen3-14B dense (fp16)	33.5 GB	100%
20% compressed (fp16)	27.9 GB	83%
25% compressed (fp16) ⬅	26.5 GB	79%
30% compressed (fp16)	25.1 GB	75%
20% compressed + 4-bit	12.25 GB	37%
25% compressed + 4-bit ⬅	11.84 GB	35%
30% compressed + 4-bit	11.44 GB	34%

Efficiency (measured, fp16, batch 4 × 2048, single 48 GB GPU)

	Qwen3-14B (base)	This model (25%)
Layers	40	30
Parameters	14.77B	11.47B
Peak inference memory (fp16)	33.5 GB	26.5 GB (−21%)
Peak inference memory (4-bit)	13.9 GB	11.84 GB (−65% vs dense fp16)
Forward latency (fp16)	2246 ms	1748 ms (−22%)

Quantization

4-bit (and other) quantization can be used with this model — it is a standard Qwen3 architecture, so bitsandbytes 4-bit / 8-bit loading and other PTQ methods apply on top of the compression. Verified: this model loads and generates correctly in 4-bit, with peak inference memory ~11.84 GB (vs 13.9 GB for the dense model in 4-bit, and 33.5 GB for the dense model in fp16).

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
m = AutoModelForCausalLM.from_pretrained(
    "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True, device_map="cuda",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True))

Usage — Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

m = AutoModelForCausalLM.from_pretrained(
    "daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True, dtype=torch.float16, device_map="cuda")
tok = AutoTokenizer.from_pretrained("daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True)

ids = tok("The capital of France is", return_tensors="pt").to("cuda")
print(tok.decode(m.generate(**ids, max_new_tokens=20)[0]))

trust_remote_code=True is required: the model ships a small custom decoder layer in modeling_qwen3_recovered.py.

Usage — vLLM

vLLM uses its own model implementations, so the custom decoder layer is loaded via a tiny plugin (provided in this repo under vllm_plugin/). Install it once, then serve normally:

pip install ./vllm_plugin   # from a checkout of this repo's vllm_plugin/ folder

from vllm import LLM, SamplingParams
llm = LLM(model="daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1", trust_remote_code=True, dtype="float16")
print(llm.generate(["The capital of France is"], SamplingParams(max_tokens=20))[0].outputs[0].text)

Other backends: TGI / SGLang / llama.cpp each use their own model graphs and would need an analogous custom decoder layer; they are not supported out of the box.

License

Apache-2.0, inherited from the base model Qwen/Qwen3-14B.

Acknowledgements

Sincere thanks to Prof. Sai Praneeth Karimireddy (University of Southern California) for his invaluable advice and feedback throughout this work, and to Prof. Sunwoo Lee (Inha University) for his guidance and support. We are also grateful to Alibaba (the Qwen team) for openly releasing the Qwen3-14B base model that made this work possible.

Downloads last month: 43

Safetensors

Model size

11B params

Tensor type

F16

Model tree for daniel-eai/Qwen3-11B-25pct-Compressed-14B-EN-V1

Base model

Qwen/Qwen3-14B-Base

Finetuned

Qwen/Qwen3-14B

Finetuned

(278)

this model