|
--- |
|
license: cc-by-nc-4.0 |
|
tags: |
|
- mms |
|
--- |
|
|
|
# Massively Multilingual Speech (MMS) - Common Crawl Language Models |
|
|
|
This repository consists of the n-gram language models trained on Common Crawl data ([Conneau et al. 2020b](https://aclanthology.org/2020.acl-main.747/), [NLLB_Team et al. 2022](https://arxiv.org/abs/2207.04672)) using [KenLM library](https://github.com/kpu/kenlm). |
|
|
|
## Table Of Content |
|
|
|
- [Example](#example) |
|
- [Supported Languages](#supported-languages) |
|
- [Model details](#model-details) |
|
- [Additional links](#additional-links) |
|
|
|
## Example |
|
|
|
```py |
|
|
|
TODO |
|
``` |
|
|
|
## Supported Languages |
|
|
|
We support language models in 102 languages. Unclick the following to toogle all supported languages of this checkpoint in [ISO 639-3 code](https://en.wikipedia.org/wiki/ISO_639-3). |
|
You can find more details about the languages and their ISO 639-3 codes in the [MMS Language Coverage Overview](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html). |
|
<details> |
|
<summary>Click to toggle</summary> |
|
|
|
- afr |
|
- amh |
|
- ara |
|
- asm |
|
- ast |
|
- azj |
|
- bel |
|
- ben |
|
- bos |
|
- bul |
|
- cat |
|
- ceb |
|
- ces |
|
- ckb |
|
- cmn |
|
- cym |
|
- dan |
|
- deu |
|
- ell |
|
- eng |
|
- est |
|
- fas |
|
- fin |
|
- fra |
|
- ful |
|
- gle |
|
- glg |
|
- guj |
|
- hau |
|
- heb |
|
- hin |
|
- hrv |
|
- hun |
|
- hye |
|
- ibo |
|
- ind |
|
- isl |
|
- ita |
|
- jav |
|
- jpn |
|
- kam |
|
- kan |
|
- kat |
|
- kaz |
|
- kea |
|
- khm |
|
- kir |
|
- kor |
|
- lao |
|
- lav |
|
- lin |
|
- lit |
|
- ltz |
|
- lug |
|
- luo |
|
- mal |
|
- mar |
|
- mkd |
|
- mlt |
|
- mon |
|
- mri |
|
- mya |
|
- nld |
|
- nob |
|
- npi |
|
- nso |
|
- nya |
|
- oci |
|
- orm |
|
- ory |
|
- pan |
|
- pol |
|
- por |
|
- pus |
|
- ron |
|
- rus |
|
- slk |
|
- slv |
|
- sna |
|
- snd |
|
- som |
|
- spa |
|
- srp |
|
- swe |
|
- swh |
|
- tam |
|
- tel |
|
- tgk |
|
- tgl |
|
- tha |
|
- tur |
|
- ukr |
|
- umb |
|
- urd |
|
- uzb |
|
- vie |
|
- wol |
|
- xho |
|
- yor |
|
- yue |
|
- zlm |
|
- zul |
|
</details> |
|
|
|
## Model details |
|
|
|
- **Developed by:** Vineel Pratap et al. |
|
- **Model type:** Multi-Lingual Automatic Speech Recognition model |
|
- **Language(s):** 126 languages, see [supported languages](#supported-languages) |
|
- **License:** CC-BY-NC 4.0 license |
|
- **Num parameters**: 1 billion |
|
- **Audio sampling rate**: 16,000 kHz |
|
- **Cite as:** |
|
|
|
@article{pratap2023mms, |
|
title={Scaling Speech Technology to 1,000+ Languages}, |
|
author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli}, |
|
journal={arXiv}, |
|
year={2023} |
|
} |
|
|
|
## Additional Links |
|
|
|
- [Blog post](https://ai.facebook.com/blog/multilingual-model-speech-recognition/) |
|
- [Transformers documentation](https://huggingface.co/docs/transformers/main/en/model_doc/mms). |
|
- [Paper](https://arxiv.org/abs/2305.13516) |
|
- [GitHub Repository](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr) |
|
- [Other **MMS** checkpoints](https://huggingface.co/models?other=mms) |
|
- MMS base checkpoints: |
|
- [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) |
|
- [facebook/mms-300m](https://huggingface.co/facebook/mms-300m) |
|
- [Official Space](https://huggingface.co/spaces/facebook/MMS) |
|
|