Atlas

E-AI Project

Qwen3-11B — 30% Compressed from Qwen3-14B (English · Chat)

This repository is part of the Efficient and Robust AI System (E-AI) Project by Vincent-Daniel Yun, which publicly releases compressed large language models. This model is a compressed edition of Qwen/Qwen3-14B with 12 of 40 transformer layers removed (28 layers remain, ≈11B parameters), then instruction-tuned for chat so it generates coherent responses at lower memory and latency.

🔗 Project: https://www.worldwidedaniel.com/eai-project 📅 Release date: 2026-06-22 · Version: V1

⚠️ Language support — English only. This model is tuned on English data and is English-focused. Other languages (e.g., Korean, Chinese, Japanese) are not officially supported and may produce degraded or broken output. For open-domain factual questions, use retrieval (RAG) — the compressed model can hallucinate specific facts.

About E-AI

Modern AI is powerful but heavy. State-of-the-art models are enormous and their inference is slow — still far from human intuition, and far too slow and unreliable to trust in urgent, high-stakes moments.

Two obstacles stand between today's models and AI we can trust in the field. Individually, each model is too large and too slow to run where it is actually needed. Collectively, when many models or agents work together, a single faulty or adversarial member can quietly derail the whole system. E-AI attacks both — making every model lightweight and fast, and keeping teams of agents reliable even when some of them fail.

I started the E-AI (Efficient-AI) project to build compact yet powerful AI that can assist people in disaster scenarios — responding to dangerous accidents quickly and reliably when every second counts.

Method

The pruning method and the recovery method used to build this model are proprietary, undisclosed methods created by Vincent-Daniel Yun and are not released. The compressed model is then instruction-tuned for chat (distilled from the base model). Only the resulting model is shared here.

Results (measured)

All numbers below were measured by us. PPL is on a 2048-token context (lower is better); downstream tasks and MMLU are 0-shot accuracy via lm-eval-harness (higher is better).

Metric Qwen3-14B (base) This model (30%)
PPL · WikiText2 ↓ 8.64 31.27
PPL · C4 ↓ 13.00 32.08
PPL · PTB ↓ 14.79 47.79
ARC-c ↑ 0.6024 0.3899
ARC-e ↑ 0.8279 0.5901
BoolQ ↑ 0.8933 0.6544
COPA ↑ 0.9000 0.7600
HellaSwag ↑ 0.7881 0.5824
OpenBookQA ↑ 0.4620 0.3500
RACE ↑ 0.4325 0.3655
RTE ↑ 0.7762 0.7184
WinoGrande ↑ 0.7317 0.6314
Avg. downstream (9) 0.7127 0.5602
MMLU 0.7729 0.6264

Model family — pick your size

All sizes in this release (click to open each model). Memory is measured peak inference (fp16 and 4-bit, batch 4 × 2048, single 48 GB GPU).

Model Layers Params MMLU ↑ Avg DS ↑ Mem fp16 Mem 4-bit
Qwen3-14B (base, uncompressed) 40 14.77B 0.773 0.713 33.5 GB 13.9 GB
Qwen3-12B-20pct-Compressed-14B-EN-V1 32 12.13B 0.722 0.639 27.9 GB 12.25 GB
Qwen3-11B-25pct-Compressed-14B-EN-V1 30 11.47B 0.680 0.598 26.5 GB 11.84 GB
➡ 30% (this model) 28 10.80B 0.626 0.560 25.1 GB 11.44 GB

Compressed + Quantization — GPU memory vs dense

How little GPU memory each option needs relative to the original dense fp16 model (lower is better; combine compression with 4-bit for the largest savings).

Configuration Peak GPU memory vs dense fp16
Qwen3-14B dense (fp16) 33.5 GB 100%
20% compressed (fp16) 27.9 GB 83%
25% compressed (fp16) 26.5 GB 79%
30% compressed (fp16) ⬅ 25.1 GB 75%
20% compressed + 4-bit 12.25 GB 37%
25% compressed + 4-bit 11.84 GB 35%
30% compressed + 4-bit ⬅ 11.44 GB 34%

Efficiency (measured, fp16, batch 4 × 2048, single 48 GB GPU)

Qwen3-14B (base) This model (30%)
Layers 40 28
Parameters 14.77B 10.80B
Peak inference memory (fp16) 33.5 GB 25.1 GB (−25%)
Peak inference memory (4-bit) 13.9 GB 11.44 GB (−66% vs dense fp16)
Forward latency (fp16) 2246 ms 1652 ms (−26%)

Quantization

4-bit (and other) quantization can be used with this model — it is a standard Qwen3 architecture, so bitsandbytes 4-bit / 8-bit loading and other PTQ methods apply on top of the compression. Verified: this model loads and generates correctly in 4-bit, with peak inference memory ~11.44 GB (vs 13.9 GB for the dense model in 4-bit, and 33.5 GB for the dense model in fp16).

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
m = AutoModelForCausalLM.from_pretrained(
    "daniel-eai/Qwen3-11B-30pct-Compressed-14B-EN-V1", trust_remote_code=True, device_map="cuda",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True))

Usage — Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

m = AutoModelForCausalLM.from_pretrained(
    "daniel-eai/Qwen3-11B-30pct-Compressed-14B-EN-V1", trust_remote_code=True, dtype=torch.float16, device_map="cuda")
tok = AutoTokenizer.from_pretrained("daniel-eai/Qwen3-11B-30pct-Compressed-14B-EN-V1", trust_remote_code=True)

ids = tok("The capital of France is", return_tensors="pt").to("cuda")
print(tok.decode(m.generate(**ids, max_new_tokens=20)[0]))

trust_remote_code=True is required: the model ships a small custom decoder layer in modeling_qwen3_recovered.py.

Usage — vLLM

vLLM uses its own model implementations, so the custom decoder layer is loaded via a tiny plugin (provided in this repo under vllm_plugin/). Install it once, then serve normally:

pip install ./vllm_plugin   # from a checkout of this repo's vllm_plugin/ folder
from vllm import LLM, SamplingParams
llm = LLM(model="daniel-eai/Qwen3-11B-30pct-Compressed-14B-EN-V1", trust_remote_code=True, dtype="float16")
print(llm.generate(["The capital of France is"], SamplingParams(max_tokens=20))[0].outputs[0].text)

Other backends: TGI / SGLang / llama.cpp each use their own model graphs and would need an analogous custom decoder layer; they are not supported out of the box.

License

Apache-2.0, inherited from the base model Qwen/Qwen3-14B.

Acknowledgements

Sincere thanks to Prof. Sai Praneeth Karimireddy (University of Southern California) for his invaluable advice and feedback throughout this work, and to Prof. Sunwoo Lee (Inha University) for his guidance and support. We are also grateful to Alibaba (the Qwen team) for openly releasing the Qwen3-14B base model that made this work possible.

Downloads last month
52
Safetensors
Model size
11B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for daniel-eai/Qwen3-11B-30pct-Compressed-14B-EN-V1

Finetuned
Qwen/Qwen3-14B
Finetuned
(279)
this model