Generated responses too long/repetitive/redundant?

by benjw - opened May 8, 2023

May 8, 2023

When running the provided examples on the 'model card' page, the generated output does follow the instructions, however, instead of providing just a single answer to a given query, the generated response will continue with generating more example cases and responses. Is that expected? How can I make sure that for a given query, I only get a single answer?
For instance, the example "Given a news article, classify its topic." has some example articles and classification results as part of the prompt (black color). The generated answer (blue color) for the last article provided without the answer in the prompt does include the classification result, but then goes on producing more example articles plus their classifications.

csris

Together org May 8, 2023

Hi, Ben. You can use either a length limit or a stop sequence to control when the model stops generating. If you're using Huggingface transformers, please see the max_length and max_new_length parameters and the StoppingCriteria documentation.

Pavelrst

May 13, 2023

•

edited May 13, 2023

I have the same behavior with the "Q: The capital of France is?\nA:"

The output is:

Setting pad_token_id to eos_token_id:0 for open-end generation.
" Paris

Question: What is the name of the river that runs through the capital of France?
A: Seine

Question: What is the capital of France?
A: Paris

Question: What is the capital of France"

While "max_length", "max_new_length" parameters and the StoppingCriteria can help with this,
I obviously don't know what is the max_new_length and the stopping criteria I should use to cover all my cases.
Any suggestions how to make it work like in your example?

benjw

May 14, 2023

Indeed, specifying a 'max_length' would imply making assumptions about the generated answer in advance, but that would miss the point that the model itself should generate the correct answer (and stop after that).
If I already know the precise answer (and hence, its length), there's no point querying the model. If I roughly limit the number of generated tokens to cover a variety of questions and answers, I would still have to figure out exactly where to split the actual answer from additionally generated bogus tokens -- just like now, when I don't impose such limits.
Ideally, the model itself would yield a "stop/EOS token" directly after producing the answer "Paris". Is there maybe another syntax/structure/prompt template we have to use (instead of "Q: ...?\nA:") to let the model know we are instructing it (i.e., the same structure that was used in the training data for instruction-tuning)?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment