bug? example of usage?

#1
by vasilee - opened

for the prompt
Translate the following English text into French: "The sun rises in the east and sets in the west."
it says
<0x0A> is [eod] Β© πŸ‘‰ yourself [eod]

umt5 or any other mT5 models are not meant to be used directly out of the box as they are not pre-trained with any supervised training. However the model is trained to have some level of language understanding, so its suggested to use the model by performing the finetuning on your downstream task to achieve good results.

do you know how to use it?
https://huggingface.co/docs/transformers/v4.31.0/en/model_doc/umt5#sample-usage
for the example in the docs A <extra_id_0> walks into a bar and orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>.
it says .<0x0A>Then hemargarita1....
instead of what is shown in the example

my expectation would be to use the same way as t5 or flan-t5 and the only difference is to be multilingual

The example shown is just the usage of how to invoke the model for inference and yes, even I'm unable to reproduce the exact outputs that are shown in the image as it could be because of the sampling that's applied when you call the generate() function.
However, as I mentioned before, the model is not usable by default as it will not produce any meaningful outputs. See the Note section here: https://huggingface.co/google/umt5-base

The only way to use this model is to finetune it on your own downstream task using your own data, only then will you be able to generate any meaningful outputs.

I've planned to finetune it on some public data in a few days from now, will share you the approach here once I finish it.

Google org

As @DeathReaper0965 mentions, the model is a pretrained artifact that should be fine-tuned. However here is a test showing how to get somewhat OK predictions:
(taken from here)

    def test_small_integration_test(self):
        """
        For comparison run the kaggle notbook available here : https://www.kaggle.com/arthurzucker/umt5-inference
        """

        model = UMT5ForConditionalGeneration.from_pretrained("google/umt5-small", return_dict=True).to(torch_device)
        tokenizer = AutoTokenizer.from_pretrained("google/umt5-small", use_fast=False, legacy=False)
        input_text = [
            "Bonjour monsieur <extra_id_0> bien <extra_id_1>.",
            "No se como puedo <extra_id_0>.",
            "This is the reason why we <extra_id_0> them.",
            "The <extra_id_0> walks in <extra_id_1>, seats",
            "A <extra_id_0> walks into a bar and orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>.",
        ]
        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids
        # fmt: off
        EXPECTED_IDS = torch.tensor(
            [
                [ 38530, 210703, 256299, 1410, 256298, 274, 1, 0,0, 0, 0, 0, 0, 0, 0, 0,0, 0],
                [   826, 321, 671, 25922, 256299, 274, 1, 0,0, 0, 0, 0, 0, 0, 0, 0,0, 0],
                [  1460, 339, 312, 19014, 10620, 758, 256299, 2355,274, 1, 0, 0, 0, 0, 0, 0,0, 0],
                [   517, 256299, 14869, 281, 301, 256298, 275, 119983,1, 0, 0, 0, 0, 0, 0, 0,0, 0],
                [   320, 256299, 14869, 281, 2234, 289, 2275, 333,61391, 289, 256298, 543, 256297, 168714, 329, 256296,274, 1],
            ]
        )
        # fmt: on
        torch.testing.assert_allclose(input_ids, EXPECTED_IDS)

        generated_ids = model.generate(input_ids.to(torch_device))
        EXPECTED_FILLING = [
            "<pad><extra_id_0> et<extra_id_1> [eod] <extra_id_2><extra_id_55>.. [eod] πŸ’ πŸ’ πŸ’ πŸ’ πŸ’ πŸ’ πŸ’ πŸ’ πŸ’ πŸ’ πŸ’ <extra_id_56>ajΕ‘ietosto<extra_id_56>lleux<extra_id_19><extra_id_6>ajΕ‘ie</s>",
            "<pad><extra_id_0>.<extra_id_1>.,<0x0A>...spech <0x0A><extra_id_20> <extra_id_21></s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",
            "<pad><extra_id_0> are not going to be a part of the world. We are not going to be a part of<extra_id_1> and<extra_id_2><0x0A><extra_id_48>.<extra_id_48></s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",
            "<pad><extra_id_0> door<extra_id_1>, the door<extra_id_2> ν”Όν•΄[/</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",
            "<pad><extra_id_0>nyone who<extra_id_1> drink<extra_id_2> a<extra_id_3> alcohol<extra_id_4> A<extra_id_5> A. This<extra_id_6> I<extra_id_7><extra_id_52><extra_id_53></s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",
        ]
        filling = tokenizer.batch_decode(generated_ids)
        self.assertEqual(filling, EXPECTED_FILLING)

Sign up or log in to comment