Train on new un-supported language, i.e. German?

#24
by kmdanikowski - opened

Has anyone trained the model for a language not currently supported? I have a few (german, Russian, polish, ukranian, greek, korean) I'd like to train it for (3B model) and would like to know what it would take.

I was thinking:
Option 1: large corpus + translated prompt xP3
Option 2: translated prompt xP3

I'd prefer just to do option 2.... but that might not be sufficient. Would love any insights others have had

PS: In the xP3 paper https://arxiv.org/pdf/2211.01786v1.pdf it specifically says there is a minor amount of non-specified langs in the ROOTS dataset BLOOM was trained on. I have observed it knows some german (not so good tho...).

BigScience Workshop org

Has anyone trained the model for a language not currently supported? I have a few (german, Russian, polish, ukranian, greek, korean) I'd like to train it for (3B model) and would like to know what it would take.

I was thinking:
Option 1: large corpus + translated prompt xP3
Option 2: translated prompt xP3

I'd prefer just to do option 2.... but that might not be sufficient. Would love any insights others have had

PS: In the xP3 paper https://arxiv.org/pdf/2211.01786v1.pdf it specifically says there is a minor amount of non-specified langs in the ROOTS dataset BLOOM was trained on. I have observed it knows some german (not so good tho...).

Hey we recently tried adding a language to BLOOMZ in this paper: https://arxiv.org/abs/2212.09535
The best way to go would be to construct an xP3 corpus for the languages you would like (similar to xP3ru from the paper) & then finetune BLOOM on them or continue finetuning of BLOOMZ.
cc @yongzx

Hello @Muennighoff , this is excellent, I love this paper! I can't seem to find the models used in the experiments or the weights. Would you happen to have those available anywhere by chance?

Based on the paper, I think the xP3 would be a great solution for my use with the 3B model. For the 560m one, I'm trying to construct a multi-lingual summarizer that would work in a few more untrained languages. Based on the paper, would xP3 work for this? Because it seems to suggest pre-training would be the way to go (a main language of interest here is Romanian.)

BigScience Workshop org

Hello @Muennighoff , this is excellent, I love this paper! I can't seem to find the models used in the experiments or the weights. Would you happen to have those available anywhere by chance?

Based on the paper, I think the xP3 would be a great solution for my use with the 3B model. For the 560m one, I'm trying to construct a multi-lingual summarizer that would work in a few more untrained languages. Based on the paper, would xP3 work for this? Because it seems to suggest pre-training would be the way to go (a main language of interest here is Romanian.)

Sure, the models are all on the hub under either bigscience or bs-la, e.g. https://huggingface.co/bs-la/bloomz-7b1-4b-xp3ru
Let me know if you don't find a specific one.

xP3 with the languages you desire should work well. I would recommend just adding your languages to the existing xP3 mixture (i.e. keep the other languages even if you don't need them), as this was better than e.g. single language xP3 with only Russian as shown in the paper.

BigScience Workshop org

I can't seem to find the models used in the experiments or the weights. Would you happen to have those available anywhere by chance?

Hi @kmdanikowski , I think @Muennighoff has helped answered all your questions. Models can be found on https://huggingface.co/bs-la/, and we are still uploading the rest of the models . Our codebase can be found on https://github.com/bigscience-workshop/multilingual-modeling.

Let us know if you have any other questions!

Sign up or log in to comment