How to fine-tune model for non-english language
Hi,
8K context length are very useful for Embedding, we stuck to long at 768 context length of all existed BERT based Text Embedding Model
But how we can fine-tuning this model (small or base) with another languages (non-english or non-latin) and not harm to 8K context length ?
Thanks Jina for this great leap
We are actually in the process of training models for more languages so stay tuned!
Anyways, for non-latin languages, you will probably want to change to a tokenizer that is trained for non-english languages. You can either use an existing multilingual tokenizer from mBERT or XLM-R or train your own tokenizer.
You can then freeze the encoder and only train the word embeddings with a procedure similar to Step 2 of this paper.
At the moment, you can only do this on embedding tasks which usually have lesser available training data (requires pairs or triplets). If you had the pretrained model with the masked language modeling (MLM) head, you would be able to do this with the MLM task which would only require you to have a corpus in your target language. Unfortunately, our pretrained model is not currently public.