Pipeline of additional languages to be supported by BLOOM

#141
by anthonyrathe - opened

Hi,

On the official BLOOM blog, it is mentioned that the BLOOM community is working on adding support for additional languages. Would anyone know which languages are included in this pipeline? More specifically I was hoping Dutch would be included, but couldn't find any confirmation of that being the case.

Many thanks in advance,
Anthony

anthonyrathe changed discussion title from Pipeline of additional languages supported by BLOOM to Pipeline of additional languages to be supported by BLOOM
BigScience Workshop org
edited Nov 16, 2022

Hi Anthony, the current pipeline includes German, Russian, and Thai. Some models can be found on https://huggingface.co/bs-la (mostly German and Russian), and we are still running some experiments on Guarani, Amharic, and Afrikaans. We didn't include Dutch as of now because according to "The State and Fate of Linguistic Diversity in the NLP World" (Joshi et al., 2020), it's considered as high-resource as Russian and it uses Latin script which is seen in pretraining corpus (such as German). Hence currently we focus on other lower resource languages.

Het wordt hoog tijd dat het Nederlands wordt gebruikt voor het testen van AI. Dat kan alleen wanneer we bij al onze communicatie Nederlands spreken. Hoewel andere talen door meer mensen worden gesproken, wordt er juist in Nederland naar verhouding veel meer onderzoek gedaan naar gebruik van Artificial Intelligence. Helaas vindt publicatie meestal in het Engels plaats, terwijl dat totaal onnodig is omdat de vertaalmachine s inmiddels zo goed zijn, dat ook wetenschappelijke literatuur goed kan worden gelezen, wanneer die literatuur in het Nederlands wordt geschreven.

BigScience Workshop org
edited Dec 3, 2022

Okay I have a weird flex on this from the previous message. Not being a Dutch speaker, I tried using BLOOM to translate the message above. Using the following prompt:

Dutch: Het wordt hoog tijd dat het Nederlands wordt gebruikt voor het testen van AI. Dat kan alleen wanneer we bij al onze communicatie Nederlands spreken. Hoewel andere talen door meer mensen worden gesproken, wordt er juist in Nederland naar verhouding veel meer onderzoek gedaan naar gebruik van Artificial Intelligence. Helaas vindt publicatie meestal in het Engels plaats, terwijl dat totaal onnodig is omdat de vertaalmachine s inmiddels zo goed zijn, dat ook wetenschappelijke literatuur goed kan worden gelezen, wanneer die literatuur in het Nederlands wordt geschreven.
English:

It started generating (in greedy mode)

It is high time that the Dutch language is used for testing AI. This can only be done when we speak Dutch in all our communication. Although other languages are spoken by more people, there is just more research in the Netherlands on the use of Artificial Intelligence. Unfortunately, publications are usually in English, although this is totally unnecessary because the translation machine is now so good that scientific literature can also be read well when it is written in Dutch.

I double checked using google translate:

It is high time that Dutch is used for testing AI. This is only possible if we speak Dutch in all our communication. Although other languages ​​are spoken by more people, relatively more research is being done into the use of Artificial Intelligence in the Netherlands. Unfortunately, publication usually takes place in English, while that is totally unnecessary because the translation machines are now so good that scientific literature can also be read properly when that literature is written in Dutch.

using DeepL:

It is high time that Dutch is used for testing AI. This is only possible if we speak Dutch in all our communications. Although other languages are spoken by more people, it is precisely in the Netherlands that much more research is being done into the use of Artificial Intelligence. Unfortunately, publication usually takes place in English, while that is totally unnecessary because translation machine s are now so good, that even scientific literature can be read well, when written in Dutch.

It's not perfect, but quite incredible since the model wasn't "trained" to do machine translation specifically or "officially" had any Dutch in it. There's probably some language leak somewhere.

So I agree, maybe one day we'll move away from english centric scientific publications, and that day might be sooner than later.

@yongzx thanks a lot for your reply! (And apologies for not replying sooner myself)
I suppose it will be up to the broader Dutch-speaking AI community to fine-tune BLOOM on a Dutch corpus then :)

anthonyrathe changed discussion status to closed

Sign up or log in to comment