model's output is abnormal

#7
by ZeroneBo - opened

The output of the model is strange, as described in the following link. Appreciate for any suggestions that may be helpful.
https://huggingface.co/google/mt5-base/discussions/3#657b1f6c69a46ce96bd2c607

Google org

hi @ZeroneBo
Thanks for the issue, I think this is expected as the model is a pre-trained model that has been pre-trained on text de-noising objective. You need to use fine-tuned models such as flan-t5 in order to prompt them out of the box

Thanks for your explanation @ybelkada . I am new with t5 families.
I have tried flan-t5-base, it seems don't suppose Chinese, becasuse it encodes Chinese characters into unk.
Is there a recommended prompt template for translation task finetuning mt5? And are there recommended prompt templates for other tasks?
I don't know whether need to add task labels manually before the input for mt5, such as: "<nli> [inputs]" for nli task; "<translate> [inputs]" for translation task, or just give the instruct finetune it, such as "Translate from Chinese to English.\nChinese: [zh_input]\nEnglish:" and label:"[en_input]" ?
Thanks again!

Google org

Hi @ZeroneBo
Thanks very much for getting back!
Indeed Chinese is not supported by flan family at all :/ From what I recall, for the prompt template it should be natural prompts such as Translate from Chinese to English.\nChinese: [zh_input]\nEnglish:
However, for translating I would suggest to use NLLB family models which should in thoery support chinese as well: https://huggingface.co/facebook/nllb-200-distilled-600M / https://huggingface.co/docs/transformers/model_doc/nllb
If you have enough GPU RAM, you can also run the NLLB-MoE model https://huggingface.co/facebook/nllb-moe-54b - let me know how it goes!

Thank you @ybelkada
I looked for some relevant documents of t5 and mt5 and readed them, the pretrained mt5 can indeed only predict the sentinel tokens and should finetune it with natural prompts before using it.
Another tiny technical issue, to make train phase faster, when creating a batch of examples, should it be dynamically padded as the length of longest squence in each batch or all squences and all batches padded a bigger fixed num? In my opinion, the dynamic method can save some GPU RAM and allow a bigger batch_size but a fixed length may be faster too, maybe the two ways cost a similar time and the differences between them can be ignored. Do you have any suggestions based on your experience? My device is 2~3 * 32G V100 or 1~2 * 40G A100 and I installed accelerate and deepspeed py-lib.

ZeroneBo changed discussion status to closed

Sign up or log in to comment