Repetition collapse when continuing an assistant turn (`continue_final_message=true`)

#114
by CalinAstroBeam - opened

Affected models: google/gemma-4-26B-A4B-it and google/gemma-4-31B-it

When continuing a partial assistant message with continue_final_message=true /
add_generation_prompt=false, the majority of completions degenerate into a
single repeated token or short phrase.

Reproduces on two independent inference stacks:

  • self-hosted vLLM
  • 3rd party Gemma4 hosting provider

Example request

{
    "model": "google/gemma-4-26B-A4B-it",
    "messages": [
        {"role": "user", "content": "user: Hello?"},
        {"role": "assistant", "content": "jack: Hello, I'm Jack! What product are you selling?"},
        {"role": "user", "content": "user: I am selling meatballs"},
        {"role": "assistant", "content": "jack: Meatballs, you say? Gotta love"}
    ],
    "max_tokens": 512,
    "temperature": 1.0,
    "chat_template_kwargs": {"enable_thinking": false},
    "skip_special_tokens": false,
    "continue_final_message": true,
    "add_generation_prompt": false
}

Observed Response

 meatballs. Now, let's get these same owners owne owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners owners

(repeats until max_tokens)

This makes continue_final_message unusable for production assistant-prefill /
"continue the turn" flows.

Sign up or log in to comment