mlx-community/translategemma-4b-it-4bit_immersive-translate

This repository is an MLX 4-bit build of google/translategemma-4b-it, derived from mlx-community/translategemma-4b-it-4bit and reconfigured for Immersive Translate when you call it through an OpenAI-compatible local server (for example mlx_lm.server). The goal is reliable on-device translation on Apple Silicon: correct stop tokens, and a chat template that understands Immersive Translate’s plain-text placeholders.

What was changed (configuration)

  • EOS / stopping (generation_config.json)
    eos_token_id is set to [106, 1] so generation can end on both the Gemma end-of-turn id and the <eos> token. That avoids runaway or awkward truncation when OpenAI-style APIs stream or strip tokens differently than a bare generate() loop.

  • Chat template (chat_template.jinja)
    The template was reformatted for Immersive Translate while staying compatible with TranslateGemma-style messages:

    • If the user message is a string and contains <<<source>>>, it is parsed into source language, target language, and body text, then expanded into the same “professional translator” instruction used in the official template.
    • If the content is the official list-of-dicts shape (source_lang_code, target_lang_code, text), that path is unchanged.
    • The generation prompt ends with <start_of_turn>model so servers that always set add_generation_prompt=True still align with how Gemma expects the assistant turn to begin.

Together, these updates make the OpenAI-compatible path (chat completions with a single user string) match what Immersive Translate sends, without fighting the tokenizer or EOS handling.

MLX OpenAI-compatible inference server

On macOS with Apple Silicon, use uv for a clean Python environment and run the built-in OpenAI-style HTTP server:

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install mlx-lm
mlx_lm.server --model mlx-community/translategemma-4b-it-4bit_immersive-translate

Immersive Translate (browser extension)

Use a custom OpenAI-compatible provider targeting your local server, http://localhost:8080/v1/chat/completions

  1. Leave System prompt empty (blank).

  2. Set Prompt and Multiple prompt to:

    <<<source>>>{{from}}<<<target>>>{{to}}<<<text>>>{{text}}

The extension fills {{from}}, {{to}}, and {{text}}; the model’s chat template turns that into the proper TranslateGemma-style user turn.

Local Python (optional)

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/translategemma-4b-it-4bit_immersive-translate")

messages = [
    {
        "role": "user",
        "content": "<<<source>>>English<<<target>>>French<<<text>>>Hello world",
    }
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
Downloads last month
902
Safetensors
Model size
0.6B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/translategemma-4b-it-4bit_immersive-translate

Quantized
(32)
this model