Incomplete Output even with max_new_tokens

#107

by Pradeep1995 - opened Dec 20, 2023

Pradeep1995

Dec 20, 2023

So the output of my model ends abruptly and I ideally want it to complete the paragraph/sentences/code which it was it between of.
Although I have provided max_new_tokens = 300 and also in prompt I give to limit by 300 words.

The response is always big and ends abruptly. Any way I can ask for a complete output within desired number of output tokens?

sniffski

Dec 20, 2023

I think you need to put max token output higher than max word count... For example put it 350 or 380

Pradeep1995

Dec 20, 2023

•

edited Dec 20, 2023

@sniffski this is my current configuration

generation_config = GenerationConfig(
    do_sample=True,
    top_k=10,
    temperature=0.01,
    pad_token_id=tokenizer.eos_token_id,
    early_stopping = True,
    max_new_tokens=300,
    return_full_text=False
)

so what change you are proposing here?

sniffski

Dec 20, 2023

Well, first thing you should know is one word is not one token... I think the rule of thumb was one token is 0.75 words... so in that case if you are requesting in the prompt answer not more than 300 words you need to set max_new_tokens=400, because 300*0.75=400

Pradeep1995

Dec 20, 2023

i tried with max_new_tokens=400. but still, the response ends abruptly problem exists. the generation suddenly stops as soon as it reaches the specified number of max_new_tokens reached, without checking whether the sentence is completed or not.

sniffski

Dec 20, 2023

•

edited Dec 20, 2023

Can you copy the output in a temp file like wc-test.txt then run in shell wc wc-test to see how many words are there (second number from the output of wc command)... If they are more than 300 than the model doesn't obey your request in the prompt for maximum words in response and issue is not the max token limit... I guess you would need to find better prompt... Try something like starting with "Your task is to respond with 300 words or less..."

SumanVakare

Dec 21, 2023

•

edited Dec 21, 2023

I have similar case, I am using API interference to run this model, my output is incomplete and usually the same length (65-80) words and most of the times it doesn't even end correctly, see below example of input and output followed by some part of the code.

Input : "Write a detailed essay about trees"

Output : "Trees are one of the most important elements of nature. They provide us with oxygen, clean the air, and provide shade from the sun. They also provide us with a variety of other benefits, such as providing food and shelter for animals, and helping to regulate the climate. Trees are also a source of beauty and inspiration, and can be used to create a sense of calm and peace in our lives. In this essay, I will explore the many benefits of trees, as well as their"

import requests

API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.1"
HEADERS = {"Authorization": "Bearer xxxxxx"}

def query(payload):
try:
response = requests.post(API_URL, headers=HEADERS, json=payload)
response.raise_for_status()
return response.json()
..............

Pradeep1995

Dec 21, 2023

@SumanVakare how did you solve this issue?

SumanVakare

Dec 21, 2023

Not solved yet, I am looking for solution too.

Hawks101

Jan 31, 2024

Has anyone found a solution for this abrupt ending issue?

OPPEYRADY

Mar 16, 2024

Running into this issue as well.

juanarri

May 6, 2024

Hey guys, anyone found a solution?

SerialKicked

May 7, 2024

•

edited May 7, 2024

There's no "solution" to what you're asking. The model output is limited by the number of tokens you allow it to output. If the expected response is longer, either increase max_new_tokens so it has room to write whatever it wanted to, or, in your Web-UI (or whatever other UI you use) there's a continue button/function to ask the model to continue with whatever it was writing.

Notes:

tokens are more like 2-4 characters than whole words
If you're using this model in particular, it's not instruction tuned, so it may continue to output BS forever (well, until it reaches max token). What you want is likely the instruction tuned version of Mistral.
It's useless to specify in your prompt the number of words you expect: LLMs can't do real math, and are even less capable of counting their own word quota as they produce the output. At best, you can give indications like "one short paragraph" or "2 long paragraphs", and it'll generally give better results (well with instruction tuned models, this base model, I'm not so sure).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment