haoranxu/ALMA-13B-R · When using beam search, the model's output may end prematurely.

Hello! I sincerely apologize for having to disturb you so close to the Spring Festival. I am currently facing some challenges in deploying ALMA-R-13B using VLLM.
Start vllm server:

python -m vllm.entrypoints.openai.api_server \
    --model "./translate/ALMA-13B-R" \
    --served-model-name "ALMA" --port 8000 --tensor-parallel-size 2\

When I use the following request body to call the service:

data = {
        "model": "ALMA",
        "prompt": "Translate this from English to Chinese:\nEnglish: " + english_text + "\nChinese:",
        "max_tokens": 512,
        "top_p": 1,
        "temperature": 0,
        "use_beam_search": True,
        "best_of": 5, 
    }

For some english_text, there is a mistake.

('Dynamic contact angle of water on the emulsion ?lms was determined by a Kru?ss K-12 contact-angle analysis instrument (Kru?ss Co., Germany) at 25 °C. Cut emulsion membrane into rectangle spline samples (2 9 1 cm) and placed them into the clip set of the instrument. The wetting circumference was ?rst measured in n-hexane and the contact angle was then measured with water as the test liquid',
  '水在乳液膜上的动态接触角由 Kru?ss K-12 接触角分析仪（德国 Kru?ss 公司）在 25 °C 下测定。将乳液膜切成矩形花键样品（2 9 1厘米），并将它们放入仪器的夹子组。首先在正己烷中测量润湿周长，然后用水作为测试液体测量接触角',
  ' 水在乳液薄膜上的动态接触角的测定，采用由克鲁斯公司（德国）生产的K-12接触角分析仪，在25℃温度下进行。将乳液')

The first line is the original text in English, the second line is the DeepL translation result, and the third line is the translation result of the model. As can be seen, the translation result of the model stops at a certain point, even though I have set max_tokens to be sufficiently large(512).
Whenever I translate numerous texts, there are always a few texts where this kind of situation occurs.
Considering that the model translated as "将乳液" instead of "切乳液" , the model should at least recognize "Cut emulsion membrane into rectangle spline samples". Its translation inference should be fine. Perhaps it prematurely generated an end-of-sentence token?
However, if I set use_beam_search to false (while also setting best_of to 1), the translation becomes complete.
Could you please check what is going on with this? Sincere thanks!