Atlas

E-AI Project

Qwen3-3B — 30% Compressed from Qwen3-4B (English · Chat)

This repository is part of the Efficient and Robust AI System (E-AI) Project by Vincent-Daniel Yun, which publicly releases compressed large language models. This model is a compressed edition of Qwen/Qwen3-4B with 11 of 36 transformer layers removed (25 layers remain, ≈3B parameters), then instruction-tuned for chat so it generates coherent responses at lower memory and latency.

🔗 Project: https://www.worldwidedaniel.com/eai-project 📅 Release date: 2026-06-28 · Version: V1

⚠️ Language support — English only. Tuned on English data. Other languages may produce degraded or broken output. For open-domain factual questions, use retrieval (RAG) — the compressed model can hallucinate specific facts.

💡 Best used as a fast, low-cost discrimination / classification engine (text classification, safety / content moderation, reading comprehension, domain QA, preference scoring) rather than an open-ended long-form generator. Compressed models preserve discrimination far better than generation.

About E-AI

Modern AI is powerful but heavy. State-of-the-art models are enormous and their inference is slow — still far from human intuition, and far too slow and unreliable to trust in urgent, high-stakes moments.

Two obstacles stand between today's models and AI we can trust in the field. Individually, each model is too large and too slow to run where it is actually needed. Collectively, when many models or agents work together, a single faulty or adversarial member can quietly derail the whole system. E-AI attacks both — making every model lightweight and fast, and keeping teams of agents reliable even when some of them fail.

I started the E-AI (Efficient-AI) project to build compact yet powerful AI that can assist people in disaster scenarios — responding to dangerous accidents quickly and reliably when every second counts.

Method

The pruning method and the recovery method used to build this model are proprietary, undisclosed methods created by Vincent-Daniel Yun and are not released. The compressed model is then instruction-tuned for chat (distilled from the base model). Only the resulting model is shared here.

Results (measured)

All numbers below were measured by us. PPL is on a 2048-token context (lower is better); downstream tasks and MMLU are 0-shot accuracy via lm-eval-harness (higher is better).

Metric Qwen3-4B (base) This model (30%)
PPL · WikiText2 ↓ 13.64 99.61
PPL · C4 ↓ 18.14 53.18
PPL · PTB ↓ 24.49 111.78
ARC-c ↑ 0.5401 0.3294
ARC-e ↑ 0.7849 0.4983
BoolQ ↑ 0.8508 0.6352
COPA ↑ 0.8100 0.6600
HellaSwag ↑ 0.6845 0.4352
OpenBookQA ↑ 0.4000 0.3040
RACE ↑ 0.4182 0.3282
RTE ↑ 0.7581 0.6931
WinoGrande ↑ 0.6614 0.6093
Avg. downstream (9) 0.6564 0.4992
MMLU 0.6839 0.6083

Task-suitability evaluation (vs dense Qwen3-4B)

This 30% compressed model is best used as a fast, low-cost discrimination / classification engine. Full-test-set scores vs the dense Qwen3-4B base (higher is better):

Qwen3-4B compressed vs dense — discrimination retention

Task Dense Qwen3-4B This model (30%)
MMLU 0.684 0.608
Avg DS (9) 0.656 0.499
AG News (classif.) 0.852 0.838
SST-2 0.899 0.888
BoolQ 0.851 0.635
MRPC 0.762 0.684
CB (NLI) 0.679 0.518
WiC 0.594 0.500
MultiRC 0.158 0.572
MedQA 0.572 0.506
MedMCQA 0.532 0.491
PubMedQA 0.768 0.642
Belebele-en 0.897 0.774
TruthfulQA 0.548 0.519
LLM-judge 0.842 0.798
SafetyBench 0.771 0.829

Takeaway. At 30% compression the model retains most of the dense Qwen3-4B's discrimination ability (classification, reading comprehension, NLI, medical QA, judging, safety screening) at smaller size and lower latency, while open-ended generation/commonsense degrade more. Use it as a discrimination engine, not a free-form generator.

Model family — pick your size

Model Layers Params Mem (fp16)
Qwen3-4B (base) 36 4.02B 11.79 GB
Qwen3-3B-20pct-Compressed-4B-EN-V1 29 3.32B 10.15 GB
Qwen3-3B-25pct-Compressed-4B-EN-V1 27 3.11B 9.67 GB
Qwen3-3B-30pct-Compressed-4B-EN-V1 25 2.91B 9.2 GB ← this model

Efficiency (measured, fp16, single GPU)

Qwen3-4B (base) This model (30%)
Layers 36 25
Parameters 4.02B 2.91B
Peak inference memory (fp16) 11.79 GB 9.2 GB (−22%)
Forward latency (fp16) 793 ms 548 ms (−31%)

Quantization

This is a standard Qwen3 architecture, so bitsandbytes 4-bit / 8-bit loading and other PTQ methods apply on top of the compression for further memory savings.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
m = AutoModelForCausalLM.from_pretrained(
    "Qwen3-3B-30pct-Compressed-4B-EN-V1", trust_remote_code=True, device_map="cuda",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True))

Usage — Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained(
    "Qwen3-3B-30pct-Compressed-4B-EN-V1", trust_remote_code=True, dtype=torch.float16, device_map="cuda")
tok = AutoTokenizer.from_pretrained("Qwen3-3B-30pct-Compressed-4B-EN-V1", trust_remote_code=True)
ids = tok("The capital of France is", return_tensors="pt").to("cuda")
print(tok.decode(m.generate(**ids, max_new_tokens=20)[0]))

trust_remote_code=True is required: the model ships a small custom decoder layer in modeling_qwen3_recovered.py.

Usage — vLLM

A tiny plugin (in vllm_plugin/) registers the custom decoder layer:

pip install ./vllm_plugin
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen3-3B-30pct-Compressed-4B-EN-V1", trust_remote_code=True, dtype="float16")
print(llm.generate(["The capital of France is"], SamplingParams(max_tokens=20))[0].outputs[0].text)

License

Apache-2.0, inherited from the base model Qwen/Qwen3-4B.

Acknowledgements

Sincere thanks to Prof. Sai Praneeth Karimireddy (University of Southern California) for his invaluable advice and feedback throughout this work, and to Prof. Sunwoo Lee (Inha University) for his guidance and support. We are also grateful to Alibaba (the Qwen team) for openly releasing the Qwen3-4B base model that made this work possible.

Downloads last month
-
Safetensors
Model size
3B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for daniel-eai/Qwen3-3B-30pct-Compressed-4B-EN-V1

Finetuned
Qwen/Qwen3-4B
Finetuned
(744)
this model