Image-Text-to-Text
Transformers
Safetensors

Handling Unknown Source Languages in Translation Templates

#19
by sandeep1401 - opened

Hi All,

I am using this model for a translation task and observed some behavior related to the chat template and source_lang_code handling.

When the source language is known, I can specify it explicitly, and the model works correctly.

Example:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "source_lang_code": "hi",   # Source language (Hindi)
                "target_lang_code": "en",   # Target language (English)
                "text": "नमस्ते, आप कैसे हैं?",
            }
        ],
    }
]

However, the issue occurs when the source language is unknown.

Case 1: Removing source_lang_code or using "auto"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "source_lang_code": "auto",   # or remove this line
                "target_lang_code": "en",
                "text": "नमस्ते, आप कैसे हैं?",
            }
        ],
    }
]

In this case, I receive a dictionary/key-related error.

Case 2: Setting the source language to "en" by default

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "source_lang_code": "en",
                "target_lang_code": "en",
                "text": "नमस्ते, आप कैसे हैं?",
            }
        ],
    }
]

Interestingly, the model still generates a proper response even though the actual source language is Hindi.

My Observation

It seems that:

  • the model may already be capable of automatically detecting the source language internally,
  • but the template currently requires source_lang_code to be explicitly provided,
  • and "auto" is not yet supported.

Suggestion

It would be very useful if the template supported:

"source_lang_code": "auto"

or handled missing source_lang_code gracefully by automatically detecting the language.

This would improve usability for real-world multilingual translation pipelines where the source language may not always be known beforehand.

Please let me know if this behavior is expected, a bug, or if support for "auto" is planned in future updates.

Thank you.

Hi @sandeep1401
Thank you for the detailed feedback. This is expected behavior at the chat template layer. The KeyError you see in Case 1 happens because the current template parser strictly requires a supported language code to format the underlying prompt. It is not a bug in the model itself.

Can you also brief about the case 2

Google org

To clarify what's happening in Case 2 the chat template validates that source_lang_code is a recognized ISO code. Because "en" is valid, it passes that check cleanly, which is why you don't see an error. As for why the translation still worked: the input text is unambiguously Hindi, and target_lang_code: "en" remains a clear directive. The model acts on what it can determine from the input and produces the correct output.
However, relying on this is not intended usage. The chat template is explicitly designed around correct, explicit language codes to ensure consistent and deterministic behavior. We recommend always passing the correct source_lang_code when known.

Hi @sandeep1401 , I wrote some code in my repo to add a new language. It is an ongoing work, but you can use it. Check https://github.com/MNIKIEMA/translategemma-finetune.
In my case, the loss is decreasing, but the quality of translation is not good. Any tips are welcome @sandeep1401 .

Thanks for the suggestion

Sign up or log in to comment