TPU continued pretraining, questions

#1
by RASMUS - opened

Hi,

First of all, great work on Korean models with TPUs. We at Finnish-NLP (https://huggingface.co/Finnish-NLP ) are inspired by your work and are planning on doing something similar. (Eg. continued pretraining from Mistral for example like here https://www.reddit.com/r/LocalLLaMA/comments/174i0vh/em_german_mistral_continous_pretraining/ or some other model)
We would be very pleased if you could share how you are doing continued pretraining on TPUs (Through TRC program as we also). Which framework are you using, how you modify tokenizer etch.

I asked this kind of question at EleutherAI discord and was adviced to look into your work.

Sincerely,
Rasmus T,
Finnish-NLP

Owner

Hello Rasmus,

It's a pleasure to connect with a fellow researcher who shares an interest in multilingual projects like me.

To start with, I've been using EasyLM (https://github.com/young-geng/EasyLM), a tool instrumental in training OpenLLAMA (https://github.com/openlm-research/open_llama). However, since EasyLM doesn't natively support GQA, I took the initiative to implement it myself.

Regarding tokenization, I've been working with corpora in both Korean (like Korean Wiki) and English (such as Pile). I opted for the SentencePiece library by Google to develop a new tokenizer from the ground up. This involved training on a mixed sample (1%) of Korean and English texts. The process led to a refined set of scored vocabularies and merges. I specifically focused on Korean vocabulary, using Regex to filter and retain the top ~15,000 tokens. This seemed sufficient to cover the majority of Korean vocabulary. My aim was to enhance Korean tokenization efficiency, so these tokens were integrated into the base model, SOLAR, and other models like Llama2. I moved away from the SPM (.model) tokenizer file, typically used with the SentencePiece Library, due to its complexity in updates. My advice would be to prefer the tokenizer.json file for compatibility with Huggingface's Tokenizers, as it can streamline your workflow and avoid unnecessary delays.

In terms of model expansion, I worked on enlarging the base model's embeddings and lm_head through the Hugging Face Transformers Library. There are multiple ways to initialize new vectors, but I recommend using the mean vector approach. For instance, if you have a new token "pajama," and the original tokenizer splits it into {"_pa", "ja", "ma"}, then the new "_pajama" token should be initialized with the average of these component vectors. This approach tends to stabilize training during the warm-up phase, reducing the likelihood of significant forgetting in the base model.

The rest of the process mirrors what you're already familiar with – training the model with your specific corpus.

I sincerely hope you find this information useful for your work.

Best regards,
Junbum Lee

beomi changed discussion status to closed

Sign up or log in to comment