could you share about pre-trained model?

#1
by ryota39 - opened

Hi, thanks for sharing great work!
What kind of pre-trained model did you use?
I would like to know whether you used some japanese corpus in the stage of pre-training.
If you used japanese corpus, It would be glad if let me know the how much token did you trained.

Here's the model I pre-trained: NilanE/tinyllama-relora-merge

However, it's not particularly good. You'd likely get similar results if you did an SFT run on top of base tinyllama.

It was trained for about 6 hours total on an A5000 using relora and axolotl, which is a miserably small amount.

The dataset is ~400mb (not sure about the token count) of English and Japanese fanfiction (I added the English fanfiction to avoid catastrophic forgetting. It makes up about 1/8th of the total).

The dataset is based on RyokoAI/Syosetu711K for the Japanese portion and RyokoAI/ScribbleHub17K for the English. I did some quality filtering and regex stuff, but not very much overall.

The pre-training is the weakest link in the chain though, and I believe it's holding the final model back by a lot. If I had the funds to do it again, I'd use a lot more data and add in a lot of English literature to teach the model creative writing, to help with the too-literal translations it makes, among other things.

Also, check out NilanE/tinyllama-en_ja-translation-v3. It's massively improved over v2 in every way (stilll uses the same base model though)

Thanks for sharing the training details!! I completely got it.
To me, in same case as you like GPU resource and corpus, I might take same actions too.
I found your v3 model after leaving this message. Great work!

NilanE changed discussion status to closed

Sign up or log in to comment