Problems with the MASK token.

by dchaplinsky - opened Feb 13, 2024

Feb 13, 2024

One of datasets I'm using for finetuning the translator, contains an example like this:

{"text": "[INST] A <MASK> prefix should be used for masks and a <LIST> prefix should be used for lists (entire parameter must be enclosed in quotes in this case). Lists must be semicolon separated. See examples. [/INST] Для масок повинен використовуватися префікс <MASK>, для списків - префікс <LIST> (весь параметр повинен братися в лапки). Список повинен розді
 ятися крапкою з комою. Див. приклади."}

As you can see, the text itself has <MASK> token. While tokenizer works fine, it generates the following sequence of tokens:

0, 29958, 10944, 881, 367, 1304, 363, 8857, 313, 296, 533, 3443, 1818, 367, 427, 15603, 297, 11839, 297, 445, 1206, 467, 2391, 29879, 1818, 367, 3031, 5283, 265, 13055, 29889, 2823, 6455, 29889, 518, 29914, 25580, 29962, 22418, 2394, 15689, 469, 1928, 9931, 26720, 10957, 1198, 2102, 9005, 4399, 29871, 32003, 29892, 3807, 531, 8993, 6104, 448, 2102, 9005, 4399, 529, 24360, 29958, 313, 1521, 1210, 19714, 12274, 29927, 469, 1928, 9931, 12224, 29600, 490, 11046, 29964, 717, 467, 22497, 469, 1928, 9931, 5724, 3429, 1225, 29600, 6697, 29964, 17941, 754, 1046, 1630, 30005, 29889, 23939, 29889, 1695, 4865, 956, 29889, 2]

and <MASK> token receives id 30002, which makes the model choke:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)

File "/home/dima/Projects/finetune-experiments/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

I can provide full traceback, if you wish.

jmprcp

Unbabel org Feb 13, 2024

Hi,
Thanks for raising this. This is related to the special tokens added after training; we were not aware it broke the model.
Could you provide the code that caused this error so we can reproduce it on our end?

dchaplinsky

Feb 13, 2024

Yes, sure:
https://github.com/dchaplinsky/finetune_experiments/blob/main/finetune.py

Let me know if you need a dataset too (it's around 3m sentence pairs).

jmprcp

Unbabel org Feb 13, 2024

I believe the issue is fixed now; I removed the special tokens from the tokenizer.

Could you download the model again and report back, please?

dchaplinsky

Feb 13, 2024

Thanks. Once I'll have a GPU available, I'll test it, currently they are busy finetuning on the dataset without flawed example :).

Did you also altered rest of the models, like 13b?

jmprcp

Unbabel org Feb 13, 2024

Got it, thanks.
The 13b model should be fixed as well.

dchaplinsky

Feb 13, 2024

Started another experiment on an updated tokenizer model. Will know the result in a few hours.

On a related note, I have a rather big ukrainian corpus, compiled from three different sources as well as some parallel en-uk corpus (cleansed version of the paracrawl). Is there any chance to include some of those in the 0.2 model?

jmprcp

Unbabel org Feb 13, 2024

•

edited Feb 14, 2024

Alright, thank you.
We are not planning to add new languages to Tower right now. However, we do encourage people to try this kind of thing; it would be cool to have a Tower-based model that supports Ukranian. You may want to assess whether it (and TowerInstruct) performs well in a 0- or 5-shot fashion as well.

dchaplinsky

Feb 20, 2024

Please let me know when you release the paper on Tower.

dchaplinsky changed discussion status to closed Feb 20, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment