How to train Dolly 2.0 with a brand new raw data set ( i.e. replace Pythia and use a new language ) ?

#80
by deepthoughts - opened

All of the training code that I looked at appears to expect an existing model. How can I train a brand new model ( say a new language ) and I want to start everytning from scratch as to not pollute the model if it's going to be optimized for a certain language ( for example )?

My first thinking is that I should go and explore how Pythia was trained and what code it used to train it. However, I thought I'd ask here first if there is already a way to use Dolly 2.0's training code.

Great job 🤗!

deepthoughts changed discussion title from How to train Dolly 2.0 with a brand new data set ( Not Pythia ) ? to How to train Dolly 2.0 with a brand new raw data set ( i.e. replace Pythia's raw data and use a new language ) ?
deepthoughts changed discussion title from How to train Dolly 2.0 with a brand new raw data set ( i.e. replace Pythia's raw data and use a new language ) ? to How to train Dolly 2.0 with a brand new raw data set ( i.e. replace Pythia and use a new language ) ?
Databricks org

You should be able to easily find the Pythia code and paper: https://github.com/EleutherAI/pythia
But you'll see that the base Pythia 12B model's training took 72,300 A100 hours (https://arxiv.org/pdf/2304.01373.pdf), which would cost you hundreds of thousands of dollars to reproduce.
That work goes into developing a base understanding of language that you want to reuse, not throw away and reproduce.
It also saw 200B tokens, and you would need to have something of that order of magnitude on hand to make this make sense.

It's rather the point here that outside of big orgs, this just isn't feasible; fine-tuning very much is.

But yes its training set is mostly English. Not entirely, and to a modest degree some language learning is language-agnostic.
You'd be much better served looking for another base model that covers the language of interest (what language?)

That's very informative. Thank you.

My thinking was that the model I create will be one with significantly less params ( millions) with the hope that once the language is learned, fine tuning to specific domains in that language will outperform large llms trained on an entirely different language.

I'm looking to train a model on an eastern European language. For example Ukrainian, Bulgarian, etc...

Databricks org

Maybe; you may find that you just don't get enough language capability with a much smaller model. Another big factor here is, do you have enough Bulgarian, etc., text to support the training?
You can also just see how well off-the-shelf Pythia (or other) models understand these languages; it could be surprising.

What do you think about the benefit of using the Pythia model but dumping a bunch of language-specific raw data at it to improve the quality of that particular language? I tried the model with Ukranian and Bulgarian and it's subpar as it starts conflacting languages a bit.

srowen changed discussion status to closed

Sign up or log in to comment