Vocab extension - inquiry

#1
by Ali-C137 - opened

Hello there I am interested to know how you managed to extend the tokenizer's vocabulary of the original model to include Korean tokens as well !?
This would be really helpful for me 🤗
Tnx in advance

Owner

Hi, First of all this model's vocabulary is not extended for Korean tokens. It uses original Mixtral tokenizer.

But there's somethings i can explain about extending vocabulary.

Since, Most of LLMs in huggingface are not optimized for Korean, many models are like use ~10 tokens per 1 word. So, Korean LLM Community had dug into extending vocabulary.

First, We started by just add vocabulary(tokens) that represent Korean word well. (E.G. https://huggingface.co/beomi/llama-2-ko-7b). As you know after adding vocabs to model the model's text-generation is broken. So, In upper example he trained the model over 40B tokens to fit Korean vocabulary well. But 40B token training is not an easy job. It takes a long time and really expensive.

So, Korean LLM Community started to find a solution. While the methodology for efficient vocabulary expansion is not yet well established, it roughly follows below.

(Source - https://huggingface.co/yanolja/KoSOLAR-10.7B-v0.2)

Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:

Freezing the embed_tokens layer for existing tokens is crucial to maintain overall performance.
Unfreezing the lm_head layer for existing tokens actually boosts performance.
As a result, we froze the internal layers and the first 32,000 embed_tokens, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.

We are keep searching for solutions. I hope this has been helpful to you.

Owner

https://huggingface.co/maywell/Mistral-ko-7B-v0.1

This is my attempt to extend vocabulary, due to budget issue it is undertrained

Tnx @maywell
I will definitely get back to this discussion later with a much rested head

Anyway, what we are trying to do is, extending vocab and then training the model on a 30B tokens of Arabic text (Not full training but using LoRA) so i believe extending vocab would definitely be suitable for my case !?

Owner

Tnx @maywell
I will definitely get back to this discussion later with a much rested head

Anyway, what we are trying to do is, extending vocab and then training the model on a 30B tokens of Arabic text (Not full training but using LoRA) so i believe extending vocab would definitely be suitable for my case !?

Of course it is. But, I recommend to do full training with the method above

Sign up or log in to comment