Arabic CT : Need for Vocabulary extension ?

#5
by Ali-C137 - opened

Hello there, i'am working on a project for arabic and we are considering 5 SOTA models (Mistral-7B, Llama2-7B, Falcon-7B, Zephyr and Qwen1.5-7B ) for our phase 1 training where we will compare their performance on arabic.

My question is : given that Qwen1.5 is multilingual by nature so i won't need to extend its Tokenizer with Arabic Vocab right ? Well i'am asking since i haven't got the time yet to go through the technical report and see the dataset description/distribution you guys used

Best
3ali

Qwen org

I am not from their team but I dont recommend extend vocab on the qwen family, its vocab size is big already. You will have a hard time finetune it on even 8xA100

Qwen org

Thanks Quan for the explanation. No need for vocab extension. You can directly use it for continue pretraining.

Sign up or log in to comment