Instructions to use ugonfor/gemma-4-E2B-it-en with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ugonfor/gemma-4-E2B-it-en with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ugonfor/gemma-4-E2B-it-en") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ugonfor/gemma-4-E2B-it-en") model = AutoModelForImageTextToText.from_pretrained("ugonfor/gemma-4-E2B-it-en") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ugonfor/gemma-4-E2B-it-en with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ugonfor/gemma-4-E2B-it-en" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ugonfor/gemma-4-E2B-it-en", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ugonfor/gemma-4-E2B-it-en
- SGLang
How to use ugonfor/gemma-4-E2B-it-en with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ugonfor/gemma-4-E2B-it-en" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ugonfor/gemma-4-E2B-it-en", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ugonfor/gemma-4-E2B-it-en" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ugonfor/gemma-4-E2B-it-en", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ugonfor/gemma-4-E2B-it-en with Docker Model Runner:
docker model run hf.co/ugonfor/gemma-4-E2B-it-en
gemma-4-E2B-it-en
An English-only vocabulary prune of google/gemma-4-E2B-it.
Non-English token rows are removed from both the input embedding table
(embed_tokens) and the Per-Layer Embedding table (embed_tokens_per_layer),
shrinking the model from 5.10 B → 3.99 B parameters (−21.8%) with no
fine-tuning.
⚠️ This is not an official Google release. Use the original
google/gemma-4-E2B-itfor multilingual deployments.
Why this exists
Gemma 4 E2B uses Per-Layer Embeddings (PLE): a [vocab_size, num_layers × hidden_size_per_layer_input]
table that adds a small embedding to the residual stream at every decoder layer.
Because PLE is indexed by token_id, vocabulary pruning saves parameters
per layer, not just once at the input — so removing 39% of the vocab removes
roughly that fraction of the dominant chunk of the model.
| component | original | pruned | saved |
|---|---|---|---|
embed_tokens (tied with lm_head) |
0.75 GB | 0.45 GB | 0.30 GB |
embed_tokens_per_layer (PLE) |
4.38 GB | 2.61 GB | 1.77 GB |
| total bf16 footprint | 9.51 GB | 7.44 GB | 2.07 GB |
| total parameters | 5.10 B | 3.99 B | −21.8% |
| vocab size | 262,144 | 156,160 | −40.4% |
| BPE merges | 514,906 | 388,702 | −24.5% |
On an 8 GB RTX 4060, the original model spills into shared system memory
(9.5 GB needed > 8 GB physical) and decodes at **2.2 tok/s**; the pruned
model fits resident and decodes at ~10 tok/s (4.4× speedup).
What was kept
| bucket | tokens |
|---|---|
| BOS / EOS / PAD / UNK / MASK + chat-template + multimodal sentinels | 24 |
<0xXX> byte-fallback |
256 |
| ASCII + Latin-1 + Latin-Extended-A + curly quotes / em-dash / ellipsis / €£¥¢ / © ® ™ / NBSP | 152,601 |
| zero-padding to a multiple of 256 | 251 |
| total | 156,160 |
What was dropped: ~6,300 unused reserved slots (<unusedNNNN>), and
~103,000 tokens belonging to other scripts (CJK, Devanagari, Cyrillic, Bengali,
Arabic, Hangul, Thai, Hiragana/Katakana, Greek, Hebrew, Tamil, emoji, etc.).
The byte-fallback layer is intact, so the tokenizer can still encode arbitrary
UTF-8 input — just inefficiently.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("ugonfor/gemma-4-E2B-it-en")
model = AutoModelForCausalLM.from_pretrained(
"ugonfor/gemma-4-E2B-it-en",
dtype=torch.bfloat16,
device_map="cuda",
)
msgs = [{"role": "user", "content": "In one short paragraph, explain what model pruning is."}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tok.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Greedy generation on the same English prompt produces byte-identical tokens
to the original google/gemma-4-E2B-it.
Limitations
- English only. Non-English text still tokenizes (via byte-fallback) but generates poorly — the rows for those tokens are zero in both embedding tables.
- No quality eval yet. Decoding matches the base model on the validation prompt above; a proper English perplexity / benchmark sweep is future work.
- Vision and audio encoders are unchanged, but the multimodal token IDs
were renumbered after the prune. The provided
config.json/generation_config.json/tokenizer_config.jsonalready reflect the new IDs — but if you wire up the multimodal pipeline by hand, use those values. - No fine-tuning was done to recover any quality loss. None was observed on simple greedy English prompts, but extensive evaluation has not been performed.
How it was built
See prune.py in the source repository (single script, ~200 lines) — it
classifies the vocab, rebuilds tokenizer.json (filtered vocab + filtered
BPE merges), index_selects the kept rows of both embedding tables, pads to a
multiple of 256, and remaps every token-ID reference in config.json and
generation_config.json.
License & attribution
This derivative is released under the same license as the base model: Gemma Terms of Use (Apache 2.0). All original Gemma 4 license terms — including use restrictions and the requirement to pass the license to downstream users — continue to apply.
- Base model:
google/gemma-4-E2B-it - Authors of base model: Google DeepMind
- Derivative: vocabulary-only prune by @ugonfor
- Downloads last month
- 1