mlx-community/translategemma-4b-it-4bit_immersive-translate
This repository is an MLX 4-bit build of google/translategemma-4b-it, derived from mlx-community/translategemma-4b-it-4bit and reconfigured for Immersive Translate when you call it through an OpenAI-compatible local server (for example mlx_lm.server). The goal is reliable on-device translation on Apple Silicon: correct stop tokens, and a chat template that understands Immersive Translate’s plain-text placeholders.
What was changed (configuration)
EOS / stopping (
generation_config.json)eos_token_idis set to[106, 1]so generation can end on both the Gemma end-of-turn id and the<eos>token. That avoids runaway or awkward truncation when OpenAI-style APIs stream or strip tokens differently than a baregenerate()loop.Chat template (
chat_template.jinja)
The template was reformatted for Immersive Translate while staying compatible with TranslateGemma-style messages:- If the user message is a string and contains
<<<source>>>, it is parsed into source language, target language, and body text, then expanded into the same “professional translator” instruction used in the official template. - If the content is the official list-of-dicts shape (
source_lang_code,target_lang_code,text), that path is unchanged. - The generation prompt ends with
<start_of_turn>modelso servers that always setadd_generation_prompt=Truestill align with how Gemma expects the assistant turn to begin.
- If the user message is a string and contains
Together, these updates make the OpenAI-compatible path (chat completions with a single user string) match what Immersive Translate sends, without fighting the tokenizer or EOS handling.
MLX OpenAI-compatible inference server
On macOS with Apple Silicon, use uv for a clean Python environment and run the built-in OpenAI-style HTTP server:
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install mlx-lm
mlx_lm.server --model mlx-community/translategemma-4b-it-4bit_immersive-translate
Immersive Translate (browser extension)
Use a custom OpenAI-compatible provider targeting your local server, http://localhost:8080/v1/chat/completions
Leave System prompt empty (blank).
Set Prompt and Multiple prompt to:
<<<source>>>{{from}}<<<target>>>{{to}}<<<text>>>{{text}}
The extension fills {{from}}, {{to}}, and {{text}}; the model’s chat template turns that into the proper TranslateGemma-style user turn.
Local Python (optional)
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/translategemma-4b-it-4bit_immersive-translate")
messages = [
{
"role": "user",
"content": "<<<source>>>English<<<target>>>French<<<text>>>Hello world",
}
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 902
4-bit
Model tree for mlx-community/translategemma-4b-it-4bit_immersive-translate
Base model
google/translategemma-4b-it