[QUESTION] model translates only a part of the text

#6
by hkad98 - opened

Hello Community,
I was exploring NLLB, but unfortunately, I encountered some issues during translations. The description says that the maximum input lengths should not exceed 512 tokens because they did not train the model on the longer sequences. So I tried to translate a text, but unfortunately, the model translated only a part of the text even though the tokenized text does not exceed 512 tokens. The text did not exceed 512 tokens and splitting the text into sentences worked. So I am wondering if anyone encountered something similar if yes, how did you solve it?

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

translator = pipeline('translation', model=model, tokenizer=tokenizer, src_lang="ces_Latn", tgt_lang='eng_Latn')
translator("Zuzka bydlí v paneláku na 9. podlaží. Anička bydlí o 3 podlaží výše. Na kterém podlaží bydlí Anička?")
# [{'translation_text': 'Zuzka lives in a boarding house on the ninth floor, Anička lives three floors up.'}]
translator(["Zuzka bydlí v paneláku na 9. podlaží. Anička bydlí o 3 podlaží výše.", "Na kterém podlaží bydlí Anička?"])
# [{'translation_text': 'Zuzka lives in a boarding house on the ninth floor, and Anička lives three floors up.'}, {'translation_text': 'What floor does Anicka live on?'}]

Hey @hkad98 !

This isn't a bug, but the generate method only outputs a specific number of tokens. You can specify the min_length and max_length parameters to get more or fewer tokens out.

Here, for example, by specifying I want a minimum number of tokens returned of 30:

>>> translator("Zuzka bydlí v paneláku na 9. podlaží. Anička bydlí o 3 podlaží výše. Na kterém podlaží bydlí Anička?", min_length=30)
Out[8]: [{'translation_text': 'Zuzka lives in a boarding house on the ninth floor, Anička lives three floors up. What floor does Anička live on?'}]

Hi @lysandre

Thank you for your clarification! Would you please point me to documentation for other parameters that can be passed to the generate method (I only found this)? I am new to hugging face transformers.

hkad98 changed discussion title from [QUESTION, BUG?] model translates only a part of the text to [QUESTION] model translates only a part of the text

Yes! Here is the documentation specific to text-generation: Text Generation.

You will be interested in the generate method in particular.

We're currently reworking this page, so any feedback is welcome! cc @patrickvonplaten

@lysandre Wow! Thank you so much. It is really useful.

hkad98 changed discussion status to closed

Sign up or log in to comment