Transformers
ctranslate2
int8
float16
Composer
MosaicML
llm-foundry

request for fixing inference script to get controlled the text output generation

#1
by g30rv17ys - opened

@michaelfeil .

thank you for the model that can be loaded in 8bit.
when I try to run inference, it keeps on generating texts.
Whereas , I just want an answer to my question.

Could you edit the inference script such that we only get precise answers and not uncontrolled generations.
I've attached a screenshot below of what output i am getting, hope you can suggest a fix for this.
Screenshot from 2023-05-30 23-24-31.png

There is no easy fix for this, a typical problem in any library.
As in https://github.com/michaelfeil/hf-hub-ctranslate2/blob/e236f006593fb00633f9874fe414c87bd9735813/hf_hub_ctranslate2/translate.py#LL309C1-L334C70
You might want to set end_token =["User“, „user“] or so.

thanks for the reply @michaelfeil .
i noticed that you can get perfect answers in the mpt-7b-chat space : https://huggingface.co/spaces/mosaicml/mpt-7b-chat
could you have a look at the app.py file.

maybe it can suggest a fix to the issue.

how would you re-write the inference example you mentioned in the readme to sort of resolve the issue even a bit.could you paste the updated inference script in your next reply?

say for eg , in the below script

from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
from transformers import AutoTokenizer

model_name = "michaelfeil/ct2fast-mpt-7b-chat"
model = GeneratorCT2fromHfHub(
# load in int8 on CUDA
model_name_or_path=model_name,
device="cuda",
compute_type="int8_float16",
)
outputs = model.generate(
text=["User: what is the largest mountain on mars? Bot:"],
max_length=256
)
print(outputs)

it would be great if the output was :
Bot : The largest mountain on mars in Olympus Mons.

It actually gives the right answer , but it then continues with some other text. I want it to stop at Olympus Mons. @michaelfeil

also you can check this notebook for eg @michaelfeil
https://colab.research.google.com/drive/1s2beg55SEV8ICHufhwcflYtUNskTX8kO

it ends the sentence perfectly!

You might set end_token=[„User“], as keyword to model.generate

@michaelfeil , i updated the code as follows

from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
from transformers import AutoTokenizer

model_name = "michaelfeil/ct2fast-mpt-7b-chat"
model = GeneratorCT2fromHfHub(
# load in int8 on CUDA
model_name_or_path=model_name,
device="cuda",
compute_type="int8_float16",
)
outputs = model.generate(
text=["User: what is the largest mountain on mars? Bot:"],
end_token=["User"],
max_length=256
)

print(outputs)

but i am still getting same continues output :

Screenshot from 2023-05-31 00-04-50.png

User and Bot are also no special tokens, they are just dummies I used for a bunch of other models.
Not sure why the stop tokens do not work in this case.

Can’t support, if you found a good solution, feel free to share it here.

Okay, stop tokens are ["<|im_end|>", "<|endoftext|>"])
And messages should be formatted in the style

f"<|im_start|>user\n{item[0]}<|im_end|>",
f"<|im_start|>assistant\n{item[1]}<|im_end|>",
]

See the app.py in mpt chat space.

@michaelfeil could you paste the whole script here if possible?

Sorry, can’t provide further help.

michaelfeil changed discussion status to closed

Sign up or log in to comment