gemma-4-E2B-it-en

An English-only vocabulary prune of google/gemma-4-E2B-it. Non-English token rows are removed from both the input embedding table (embed_tokens) and the Per-Layer Embedding table (embed_tokens_per_layer), shrinking the model from 5.10 B → 3.99 B parameters (−21.8%) with no fine-tuning.

⚠️ This is not an official Google release. Use the original google/gemma-4-E2B-it for multilingual deployments.

Why this exists

Gemma 4 E2B uses Per-Layer Embeddings (PLE): a [vocab_size, num_layers × hidden_size_per_layer_input] table that adds a small embedding to the residual stream at every decoder layer. Because PLE is indexed by token_id, vocabulary pruning saves parameters per layer, not just once at the input — so removing 39% of the vocab removes roughly that fraction of the dominant chunk of the model.

component original pruned saved
embed_tokens (tied with lm_head) 0.75 GB 0.45 GB 0.30 GB
embed_tokens_per_layer (PLE) 4.38 GB 2.61 GB 1.77 GB
total bf16 footprint 9.51 GB 7.44 GB 2.07 GB
total parameters 5.10 B 3.99 B −21.8%
vocab size 262,144 156,160 −40.4%
BPE merges 514,906 388,702 −24.5%

On an 8 GB RTX 4060, the original model spills into shared system memory (9.5 GB needed > 8 GB physical) and decodes at **2.2 tok/s**; the pruned model fits resident and decodes at ~10 tok/s (4.4× speedup).

What was kept

bucket tokens
BOS / EOS / PAD / UNK / MASK + chat-template + multimodal sentinels 24
<0xXX> byte-fallback 256
ASCII + Latin-1 + Latin-Extended-A + curly quotes / em-dash / ellipsis / €£¥¢ / © ® ™ / NBSP 152,601
zero-padding to a multiple of 256 251
total 156,160

What was dropped: ~6,300 unused reserved slots (<unusedNNNN>), and ~103,000 tokens belonging to other scripts (CJK, Devanagari, Cyrillic, Bengali, Arabic, Hangul, Thai, Hiragana/Katakana, Greek, Hebrew, Tamil, emoji, etc.). The byte-fallback layer is intact, so the tokenizer can still encode arbitrary UTF-8 input — just inefficiently.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("ugonfor/gemma-4-E2B-it-en")
model = AutoModelForCausalLM.from_pretrained(
    "ugonfor/gemma-4-E2B-it-en",
    dtype=torch.bfloat16,
    device_map="cuda",
)

msgs = [{"role": "user", "content": "In one short paragraph, explain what model pruning is."}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tok.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Greedy generation on the same English prompt produces byte-identical tokens to the original google/gemma-4-E2B-it.

Limitations

  • English only. Non-English text still tokenizes (via byte-fallback) but generates poorly — the rows for those tokens are zero in both embedding tables.
  • No quality eval yet. Decoding matches the base model on the validation prompt above; a proper English perplexity / benchmark sweep is future work.
  • Vision and audio encoders are unchanged, but the multimodal token IDs were renumbered after the prune. The provided config.json / generation_config.json / tokenizer_config.json already reflect the new IDs — but if you wire up the multimodal pipeline by hand, use those values.
  • No fine-tuning was done to recover any quality loss. None was observed on simple greedy English prompts, but extensive evaluation has not been performed.

How it was built

See prune.py in the source repository (single script, ~200 lines) — it classifies the vocab, rebuilds tokenizer.json (filtered vocab + filtered BPE merges), index_selects the kept rows of both embedding tables, pads to a multiple of 256, and remaps every token-ID reference in config.json and generation_config.json.

License & attribution

This derivative is released under the same license as the base model: Gemma Terms of Use (Apache 2.0). All original Gemma 4 license terms — including use restrictions and the requirement to pass the license to downstream users — continue to apply.

Downloads last month
1
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ugonfor/gemma-4-E2B-it-en

Quantized
(222)
this model