model's output is abnormal
The output of the model is strange, as described in the following link. Appreciate for any suggestions that may be helpful.
https://huggingface.co/google/mt5-base/discussions/3#657b1f6c69a46ce96bd2c607
Thanks for your explanation
@ybelkada
. I am new with t5 families.
I have tried flan-t5-base, it seems don't suppose Chinese, becasuse it encodes Chinese characters into unk.
Is there a recommended prompt template for translation task finetuning mt5? And are there recommended prompt templates for other tasks?
I don't know whether need to add task labels manually before the input for mt5, such as: "<nli> [inputs]" for nli task; "<translate> [inputs]" for translation task, or just give the instruct finetune it, such as "Translate from Chinese to English.\nChinese: [zh_input]\nEnglish:" and label:"[en_input]" ?
Thanks again!
Hi
@ZeroneBo
Thanks very much for getting back!
Indeed Chinese is not supported by flan family at all :/ From what I recall, for the prompt template it should be natural prompts such as Translate from Chinese to English.\nChinese: [zh_input]\nEnglish:
However, for translating I would suggest to use NLLB family models which should in thoery support chinese as well: https://huggingface.co/facebook/nllb-200-distilled-600M / https://huggingface.co/docs/transformers/model_doc/nllb
If you have enough GPU RAM, you can also run the NLLB-MoE model https://huggingface.co/facebook/nllb-moe-54b - let me know how it goes!
Thank you
@ybelkada
I looked for some relevant documents of t5 and mt5 and readed them, the pretrained mt5 can indeed only predict the sentinel tokens and should finetune it with natural prompts before using it.
Another tiny technical issue, to make train phase faster, when creating a batch of examples, should it be dynamically padded as the length of longest squence in each batch or all squences and all batches padded a bigger fixed num? In my opinion, the dynamic method can save some GPU RAM and allow a bigger batch_size but a fixed length may be faster too, maybe the two ways cost a similar time and the differences between them can be ignored. Do you have any suggestions based on your experience? My device is 2~3 * 32G V100 or 1~2 * 40G A100 and I installed accelerate and deepspeed py-lib.