[QUESTION] model translates only a part of the text
Hello Community,
I was exploring NLLB, but unfortunately, I encountered some issues during translations. The description says that the maximum input lengths should not exceed 512 tokens because they did not train the model on the longer sequences. So I tried to translate a text, but unfortunately, the model translated only a part of the text even though the tokenized text does not exceed 512 tokens. The text did not exceed 512 tokens and splitting the text into sentences worked. So I am wondering if anyone encountered something similar if yes, how did you solve it?
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
translator = pipeline('translation', model=model, tokenizer=tokenizer, src_lang="ces_Latn", tgt_lang='eng_Latn')
translator("Zuzka bydlí v paneláku na 9. podlaží. Anička bydlí o 3 podlaží výše. Na kterém podlaží bydlí Anička?")
# [{'translation_text': 'Zuzka lives in a boarding house on the ninth floor, Anička lives three floors up.'}]
translator(["Zuzka bydlí v paneláku na 9. podlaží. Anička bydlí o 3 podlaží výše.", "Na kterém podlaží bydlí Anička?"])
# [{'translation_text': 'Zuzka lives in a boarding house on the ninth floor, and Anička lives three floors up.'}, {'translation_text': 'What floor does Anicka live on?'}]
Hey @hkad98 !
This isn't a bug, but the generate
method only outputs a specific number of tokens. You can specify the min_length
and max_length
parameters to get more or fewer tokens out.
Here, for example, by specifying I want a minimum number of tokens returned of 30:
>>> translator("Zuzka bydlí v paneláku na 9. podlaží. Anička bydlí o 3 podlaží výše. Na kterém podlaží bydlí Anička?", min_length=30)
Out[8]: [{'translation_text': 'Zuzka lives in a boarding house on the ninth floor, Anička lives three floors up. What floor does Anička live on?'}]
Yes! Here is the documentation specific to text-generation: Text Generation.
You will be interested in the generate method in particular.
We're currently reworking this page, so any feedback is welcome! cc @patrickvonplaten