Problems with the MASK token.

#3
by dchaplinsky - opened

One of datasets I'm using for finetuning the translator, contains an example like this:

{"text": "[INST] A <MASK> prefix should be used for masks and a <LIST> prefix should be used for lists (entire parameter must be enclosed in quotes in this case). Lists must be semicolon separated. See examples. [/INST] Для масок повинен використовуватися префікс <MASK>, для списків - префікс <LIST> (весь параметр повинен братися в лапки). Список повинен розді
 ятися крапкою з комою. Див. приклади."}

As you can see, the text itself has <MASK> token. While tokenizer works fine, it generates the following sequence of tokens:

0, 29958, 10944, 881, 367, 1304, 363, 8857, 313, 296, 533, 3443, 1818, 367, 427, 15603, 297, 11839, 297, 445, 1206, 467, 2391, 29879, 1818, 367, 3031, 5283, 265, 13055, 29889, 2823, 6455, 29889, 518, 29914, 25580, 29962, 22418, 2394, 15689, 469, 1928, 9931, 26720, 10957, 1198, 2102, 9005, 4399, 29871, 32003, 29892, 3807, 531, 8993, 6104, 448, 2102, 9005, 4399, 529, 24360, 29958, 313, 1521, 1210, 19714, 12274, 29927, 469, 1928, 9931, 12224, 29600, 490, 11046, 29964, 717, 467, 22497, 469, 1928, 9931, 5724, 3429, 1225, 29600, 6697, 29964, 17941, 754, 1046, 1630, 30005, 29889, 23939, 29889, 1695, 4865, 956, 29889, 2]

and <MASK> token receives id 30002, which makes the model choke:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
File "/home/dima/Projects/finetune-experiments/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

I can provide full traceback, if you wish.

Unbabel org

Hi,
Thanks for raising this. This is related to the special tokens added after training; we were not aware it broke the model.
Could you provide the code that caused this error so we can reproduce it on our end?

Yes, sure:
https://github.com/dchaplinsky/finetune_experiments/blob/main/finetune.py

Let me know if you need a dataset too (it's around 3m sentence pairs).

Unbabel org

I believe the issue is fixed now; I removed the special tokens from the tokenizer.

Could you download the model again and report back, please?

Thanks. Once I'll have a GPU available, I'll test it, currently they are busy finetuning on the dataset without flawed example :).

Did you also altered rest of the models, like 13b?

Unbabel org

Got it, thanks.
The 13b model should be fixed as well.

Started another experiment on an updated tokenizer model. Will know the result in a few hours.

On a related note, I have a rather big ukrainian corpus, compiled from three different sources as well as some parallel en-uk corpus (cleansed version of the paracrawl). Is there any chance to include some of those in the 0.2 model?

Unbabel org
edited Feb 14

Alright, thank you.
We are not planning to add new languages to Tower right now. However, we do encourage people to try this kind of thing; it would be cool to have a Tower-based model that supports Ukranian. You may want to assess whether it (and TowerInstruct) performs well in a 0- or 5-shot fashion as well.

Please let me know when you release the paper on Tower.

dchaplinsky changed discussion status to closed

Sign up or log in to comment