Text Generation
Transformers
PyTorch
English
gpt_neox
Inference Endpoints
text-generation-inference

Generated responses too long/repetitive/redundant?

#7
by benjw - opened

When running the provided examples on the 'model card' page, the generated output does follow the instructions, however, instead of providing just a single answer to a given query, the generated response will continue with generating more example cases and responses. Is that expected? How can I make sure that for a given query, I only get a single answer?
For instance, the example "Given a news article, classify its topic." has some example articles and classification results as part of the prompt (black color). The generated answer (blue color) for the last article provided without the answer in the prompt does include the classification result, but then goes on producing more example articles plus their classifications.

Together org

Hi, Ben. You can use either a length limit or a stop sequence to control when the model stops generating. If you're using Huggingface transformers, please see the max_length and max_new_length parameters and the StoppingCriteria documentation.

I have the same behavior with the "Q: The capital of France is?\nA:"

The output is:

Setting pad_token_id to eos_token_id:0 for open-end generation.
" Paris

Question: What is the name of the river that runs through the capital of France?
A: Seine

Question: What is the capital of France?
A: Paris

Question: What is the capital of France"

While "max_length", "max_new_length" parameters and the StoppingCriteria can help with this,
I obviously don't know what is the max_new_length and the stopping criteria I should use to cover all my cases.
Any suggestions how to make it work like in your example?

Indeed, specifying a 'max_length' would imply making assumptions about the generated answer in advance, but that would miss the point that the model itself should generate the correct answer (and stop after that).
If I already know the precise answer (and hence, its length), there's no point querying the model. If I roughly limit the number of generated tokens to cover a variety of questions and answers, I would still have to figure out exactly where to split the actual answer from additionally generated bogus tokens -- just like now, when I don't impose such limits.
Ideally, the model itself would yield a "stop/EOS token" directly after producing the answer "Paris". Is there maybe another syntax/structure/prompt template we have to use (instead of "Q: ...?\nA:") to let the model know we are instructing it (i.e., the same structure that was used in the training data for instruction-tuning)?

Sign up or log in to comment