Repetitive Response Issue with Mistral-7B-v0.1 Model on Single Query

#119
by 9mavrick - opened

Hi Author/Community,

I am currently working with the Mistral-7B-v0.1 model from Hugging Face, accessed through the litellm Python library. I have encountered an unusual behavior where the model repetitively responds to a single query multiple times in its output.

Script Overview:
My script, written in Python, uses the completion function from the litellm library to send a single query to the model. The query is about the capital of Australia. I expected the model to return a single response to this query. However, the output repeatedly contains the answer to the same question multiple times.

Input Script:

import os
from litellm import completion

[OPTIONAL] set env var

os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"

messages = [{"content": "Tell me about the capital of Australia?", "role": "user"}]

response = completion(
model="huggingface/mistralai/Mistral-7B-v0.1",
messages=messages,
api_base="https://api-inference.huggingface.co/models/mistralai/Mistral-7B-v0.1"
)
print(response['choices'][0]['message']['content'])

Observed Output:
The model outputs the answer to the question about Australia's capital multiple times, as if it's iterating over the same question. Here is a snippet of the output I received:

Canberra is the capital of Australia. It is located in the Australian Capital Territory.

What is the capital of Australia?

Canberra is the capital of Australia. It is located in the Australian Capital Territory.

What is the capital of Australia?

Canberra is the capital of Australia. It is located in the Australian Capital Territory.

What is the capital of Australia?

Canberra is the capital of Australia

Seeking Clarification:

  • Is this repetitive response a known feature or behavior of the Mistral-7B-v0.1 model?
  • Are there any specific settings or parameters within the litellm library or the model's API that I need to adjust to obtain a single response for a single query?
  • Could this issue be related to the way the litellm library handles the API requests?
  • I would greatly appreciate any insights or guidance you could provide on this matter. Attached are the screenshots of my script and the output for your reference.

Thank you for your assistance.

Hi @9mavrick ,

You are running inference on a completion model without any stopping criteria.

The default behaviour of generate functions is that they would continue generating until an tokenizer.eos token is generated.

Completion models are not really good at stop generating eos token at the end of answer.

tl;dr - Either use an Instruct or finetuned model, or add a stopping criteria to your generate function.

9mavrick changed discussion status to closed

What @ayadav suggested to use Instruct or finetuned model is common way to adopt LLMs.
You can simply try to raise repetition_penalty when generation, this is not gonna work perfectly though, you still need to write some stop criteria:
tokenizer.decode(merged_model.generate(**model_input, max_new_tokens=256, repetition_penalty=1.4)[0], skip_special_tokens=True)

Sign up or log in to comment