Handling Unknown Source Languages in Translation Templates

#19

by sandeep1401 - opened 13 days ago

Hi All,

I am using this model for a translation task and observed some behavior related to the chat template and source_lang_code handling.

When the source language is known, I can specify it explicitly, and the model works correctly.

Example:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "source_lang_code": "hi",   # Source language (Hindi)
                "target_lang_code": "en",   # Target language (English)
                "text": "नमस्ते, आप कैसे हैं?",
            }
        ],
    }
]

However, the issue occurs when the source language is unknown.

Case 1: Removing `source_lang_code` or using `"auto"`

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "source_lang_code": "auto",   # or remove this line
                "target_lang_code": "en",
                "text": "नमस्ते, आप कैसे हैं?",
            }
        ],
    }
]

In this case, I receive a dictionary/key-related error.

Case 2: Setting the source language to `"en"` by default

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "source_lang_code": "en",
                "target_lang_code": "en",
                "text": "नमस्ते, आप कैसे हैं?",
            }
        ],
    }
]

Interestingly, the model still generates a proper response even though the actual source language is Hindi.

My Observation

It seems that:

the model may already be capable of automatically detecting the source language internally,
but the template currently requires source_lang_code to be explicitly provided,
and "auto" is not yet supported.

Suggestion

It would be very useful if the template supported:

"source_lang_code": "auto"

or handled missing source_lang_code gracefully by automatically detecting the language.

This would improve usability for real-world multilingual translation pipelines where the source language may not always be known beforehand.

Please let me know if this behavior is expected, a bug, or if support for "auto" is planned in future updates.

Thank you.

pannaga10

Google org 2 days ago

•

edited 2 days ago

Hi @sandeep1401
Thank you for the detailed feedback. This is expected behavior at the chat template layer. The KeyError you see in Case 1 happens because the current template parser strictly requires a supported language code to format the underlying prompt. It is not a bug in the model itself.

sandeep1401

2 days ago

Can you also brief about the case 2

pannaga10

Google org 1 day ago

To clarify what's happening in Case 2 the chat template validates that source_lang_code is a recognized ISO code. Because "en" is valid, it passes that check cleanly, which is why you don't see an error. As for why the translation still worked: the input text is unambiguously Hindi, and target_lang_code: "en" remains a clear directive. The model acts on what it can determine from the input and produces the correct output.
However, relying on this is not intended usage. The chat template is explicitly designed around correct, explicit language codes to ensure consistent and deterministic behavior. We recommend always passing the correct source_lang_code when known.

madoss

1 day ago

Hi @sandeep1401 , I wrote some code in my repo to add a new language. It is an ongoing work, but you can use it. Check https://github.com/MNIKIEMA/translategemma-finetune.
In my case, the loss is decreasing, but the quality of translation is not good. Any tips are welcome @sandeep1401 .

sandeep1401

about 17 hours ago

Thanks for the suggestion

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Handling Unknown Source Languages in Translation Templates

Case 1: Removing source_lang_code or using "auto"

Case 2: Setting the source language to "en" by default

My Observation

Suggestion

Case 1: Removing `source_lang_code` or using `"auto"`

Case 2: Setting the source language to `"en"` by default