arabic tokenizer

#1
by abdoelsayed - opened

Hello
I think there is a mistake the llama tokenizer not working with Arabic if you tried it
you will find that it tokenize the word as characters

I can confirm this. Has there been a solution for that?

toy have two solution one if easy to make the model work with characters like what happen now.
or increase tokenizer with arabic tokens but in this case you will finetune the model to update embedding weight

You can increase the tokenizer's vocabulary using tokenizer.add_tokens() right?
Also do you mean by your last statement that you first need a fine-tuning round for the model to update the embedding weight, then another round to fine-tune on a specific task?

ya if you increased the tokens you should finetung the model using arabic dataset so you can update the embedding weight for arabic tokens

Thanks a lot!

@abdoelsayed
About ya if you increased the tokens you should finetung the model using arabic dataset so you can update the embedding weight for arabic tokens

Will finetune be sufficient or will we need to continue pretraining?

Sign up or log in to comment