Instructions to use ugonfor/gemma-4-E2B-it-en with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ugonfor/gemma-4-E2B-it-en with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ugonfor/gemma-4-E2B-it-en")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("ugonfor/gemma-4-E2B-it-en")
model = AutoModelForImageTextToText.from_pretrained("ugonfor/gemma-4-E2B-it-en")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ugonfor/gemma-4-E2B-it-en with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ugonfor/gemma-4-E2B-it-en"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ugonfor/gemma-4-E2B-it-en",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ugonfor/gemma-4-E2B-it-en

SGLang

How to use ugonfor/gemma-4-E2B-it-en with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ugonfor/gemma-4-E2B-it-en" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ugonfor/gemma-4-E2B-it-en",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ugonfor/gemma-4-E2B-it-en" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ugonfor/gemma-4-E2B-it-en",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ugonfor/gemma-4-E2B-it-en with Docker Model Runner:
```
docker model run hf.co/ugonfor/gemma-4-E2B-it-en
```

gemma-4-E2B-it-en

An English-only vocabulary prune of google/gemma-4-E2B-it. Non-English token rows are removed from both the input embedding table (embed_tokens) and the Per-Layer Embedding table (embed_tokens_per_layer), shrinking the model from 5.10 B → 3.99 B parameters (−21.8%) with no fine-tuning.

⚠️ This is not an official Google release. Use the original google/gemma-4-E2B-it for multilingual deployments.

Why this exists

Gemma 4 E2B uses Per-Layer Embeddings (PLE): a [vocab_size, num_layers × hidden_size_per_layer_input] table that adds a small embedding to the residual stream at every decoder layer. Because PLE is indexed by token_id, vocabulary pruning saves parameters per layer, not just once at the input — so removing 39% of the vocab removes roughly that fraction of the dominant chunk of the model.

component	original	pruned	saved
`embed_tokens` (tied with `lm_head`)	0.75 GB	0.45 GB	0.30 GB
`embed_tokens_per_layer` (PLE)	4.38 GB	2.61 GB	1.77 GB
total bf16 footprint	9.51 GB	7.44 GB	2.07 GB
total parameters	5.10 B	3.99 B	−21.8%
vocab size	262,144	156,160	−40.4%
BPE merges	514,906	388,702	−24.5%

On an 8 GB RTX 4060, the original model spills into shared system memory (9.5 GB needed > 8 GB physical) and decodes at **2.2 tok/s**; the pruned model fits resident and decodes at ~10 tok/s (4.4× speedup).

What was kept

bucket	tokens
BOS / EOS / PAD / UNK / MASK + chat-template + multimodal sentinels	24
`<0xXX>` byte-fallback	256
ASCII + Latin-1 + Latin-Extended-A + curly quotes / em-dash / ellipsis / €£¥¢ / © ® ™ / NBSP	152,601
zero-padding to a multiple of 256	251
total	156,160

What was dropped: ~6,300 unused reserved slots (<unusedNNNN>), and ~103,000 tokens belonging to other scripts (CJK, Devanagari, Cyrillic, Bengali, Arabic, Hangul, Thai, Hiragana/Katakana, Greek, Hebrew, Tamil, emoji, etc.). The byte-fallback layer is intact, so the tokenizer can still encode arbitrary UTF-8 input — just inefficiently.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("ugonfor/gemma-4-E2B-it-en")
model = AutoModelForCausalLM.from_pretrained(
    "ugonfor/gemma-4-E2B-it-en",
    dtype=torch.bfloat16,
    device_map="cuda",
)

msgs = [{"role": "user", "content": "In one short paragraph, explain what model pruning is."}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tok.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Greedy generation on the same English prompt produces byte-identical tokens to the original google/gemma-4-E2B-it.

Limitations

English only. Non-English text still tokenizes (via byte-fallback) but generates poorly — the rows for those tokens are zero in both embedding tables.
No quality eval yet. Decoding matches the base model on the validation prompt above; a proper English perplexity / benchmark sweep is future work.
Vision and audio encoders are unchanged, but the multimodal token IDs were renumbered after the prune. The provided config.json / generation_config.json / tokenizer_config.json already reflect the new IDs — but if you wire up the multimodal pipeline by hand, use those values.
No fine-tuning was done to recover any quality loss. None was observed on simple greedy English prompts, but extensive evaluation has not been performed.

How it was built

See prune.py in the source repository (single script, ~200 lines) — it classifies the vocab, rebuilds tokenizer.json (filtered vocab + filtered BPE merges), index_selects the kept rows of both embedding tables, pads to a multiple of 256, and remaps every token-ID reference in config.json and generation_config.json.

License & attribution

This derivative is released under the same license as the base model: Gemma Terms of Use (Apache 2.0). All original Gemma 4 license terms — including use restrictions and the requirement to pass the license to downstream users — continue to apply.

Base model: google/gemma-4-E2B-it
Authors of base model: Google DeepMind
Derivative: vocabulary-only prune by @ugonfor

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for ugonfor/gemma-4-E2B-it-en

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Quantized

(222)

this model