Instructions to use daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1

SGLang

How to use daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1 with Docker Model Runner:
```
docker model run hf.co/daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1
```

Qwen3-3B — 25% Compressed from Qwen3-4B (English · Chat)

This repository is part of the Efficient and Robust AI System (E-AI) Project by Vincent-Daniel Yun, which publicly releases compressed large language models. This model is a compressed edition of Qwen/Qwen3-4B with 9 of 36 transformer layers removed (27 layers remain, ≈3B parameters), then instruction-tuned for chat so it generates coherent responses at lower memory and latency.

🔗 Project: https://www.worldwidedaniel.com/eai-project 📅 Release date: 2026-06-28 · Version: V1

⚠️ Language support — English only. Tuned on English data. Other languages may produce degraded or broken output. For open-domain factual questions, use retrieval (RAG) — the compressed model can hallucinate specific facts.

💡 Best used as a fast, low-cost discrimination / classification engine (text classification, safety / content moderation, reading comprehension, domain QA, preference scoring) rather than an open-ended long-form generator. Compressed models preserve discrimination far better than generation.

About E-AI

Modern AI is powerful but heavy. State-of-the-art models are enormous and their inference is slow — still far from human intuition, and far too slow and unreliable to trust in urgent, high-stakes moments.

Two obstacles stand between today's models and AI we can trust in the field. Individually, each model is too large and too slow to run where it is actually needed. Collectively, when many models or agents work together, a single faulty or adversarial member can quietly derail the whole system. E-AI attacks both — making every model lightweight and fast, and keeping teams of agents reliable even when some of them fail.

I started the E-AI (Efficient-AI) project to build compact yet powerful AI that can assist people in disaster scenarios — responding to dangerous accidents quickly and reliably when every second counts.

Method

The pruning method and the recovery method used to build this model are proprietary, undisclosed methods created by Vincent-Daniel Yun and are not released. The compressed model is then instruction-tuned for chat (distilled from the base model). Only the resulting model is shared here.

Results (measured)

All numbers below were measured by us. PPL is on a 2048-token context (lower is better); downstream tasks and MMLU are 0-shot accuracy via lm-eval-harness (higher is better).

Metric	Qwen3-4B (base)	This model (25%)
PPL · WikiText2 ↓	13.64	45.12
PPL · C4 ↓	18.14	40.59
PPL · PTB ↓	24.49	63.04
ARC-c ↑	0.5401	0.3430
ARC-e ↑	0.7849	0.5539
BoolQ ↑	0.8508	0.7917
COPA ↑	0.8100	0.7000
HellaSwag ↑	0.6845	0.4767
OpenBookQA ↑	0.4000	0.3120
RACE ↑	0.4182	0.3608
RTE ↑	0.7581	0.7653
WinoGrande ↑	0.6614	0.6338
Avg. downstream (9) ↑	0.6564	0.5486
MMLU ↑	0.6839	0.6540

Task-suitability evaluation (vs dense Qwen3-4B)

This 25% compressed model is best used as a fast, low-cost discrimination / classification engine. Full-test-set scores vs the dense Qwen3-4B base (higher is better):

Task	Dense Qwen3-4B	This model (25%)
MMLU	0.684	0.654
Avg DS (9)	0.656	0.549
AG News (classif.)	0.852	0.854
SST-2	0.899	0.882
BoolQ	0.851	0.792
MRPC	0.762	0.757
CB (NLI)	0.679	0.661
WiC	0.594	0.505
MultiRC	0.158	0.572
MedQA	0.572	0.560
MedMCQA	0.532	0.522
PubMedQA	0.768	0.700
Belebele-en	0.897	0.870
TruthfulQA	0.548	0.529
LLM-judge	0.842	0.838
SafetyBench	0.771	0.771

Takeaway. At 25% compression the model retains most of the dense Qwen3-4B's discrimination ability (classification, reading comprehension, NLI, medical QA, judging, safety screening) at smaller size and lower latency, while open-ended generation/commonsense degrade more. Use it as a discrimination engine, not a free-form generator.

Model family — pick your size

Model	Layers	Params	Mem (fp16)
Qwen3-4B (base)	36	4.02B	11.79 GB
Qwen3-3B-20pct-Compressed-4B-EN-V1	29	3.32B	10.15 GB
Qwen3-3B-25pct-Compressed-4B-EN-V1	27	3.11B	9.67 GB ← this model
Qwen3-3B-30pct-Compressed-4B-EN-V1	25	2.91B	9.2 GB

Efficiency (measured, fp16, single GPU)

	Qwen3-4B (base)	This model (25%)
Layers	36	27
Parameters	4.02B	3.11B
Peak inference memory (fp16)	11.79 GB	9.67 GB (−18%)
Forward latency (fp16)	793 ms	602 ms (−24%)

Quantization

This is a standard Qwen3 architecture, so bitsandbytes 4-bit / 8-bit loading and other PTQ methods apply on top of the compression for further memory savings.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
m = AutoModelForCausalLM.from_pretrained(
    "Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True, device_map="cuda",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True))

Usage — Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained(
    "Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True, dtype=torch.float16, device_map="cuda")
tok = AutoTokenizer.from_pretrained("Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True)
ids = tok("The capital of France is", return_tensors="pt").to("cuda")
print(tok.decode(m.generate(**ids, max_new_tokens=20)[0]))

trust_remote_code=True is required: the model ships a small custom decoder layer in modeling_qwen3_recovered.py.

Usage — vLLM

A tiny plugin (in vllm_plugin/) registers the custom decoder layer:

pip install ./vllm_plugin

from vllm import LLM, SamplingParams
llm = LLM(model="Qwen3-3B-25pct-Compressed-4B-EN-V1", trust_remote_code=True, dtype="float16")
print(llm.generate(["The capital of France is"], SamplingParams(max_tokens=20))[0].outputs[0].text)

License

Apache-2.0, inherited from the base model Qwen/Qwen3-4B.

Acknowledgements

Sincere thanks to Prof. Sai Praneeth Karimireddy (University of Southern California) for his invaluable advice and feedback throughout this work, and to Prof. Sunwoo Lee (Inha University) for his guidance and support. We are also grateful to Alibaba (the Qwen team) for openly releasing the Qwen3-4B base model that made this work possible.

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

F16

Model tree for daniel-eai/Qwen3-3B-25pct-Compressed-4B-EN-V1

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(744)

this model