Edit model card

NeMo Megatron-mT5 3B

|Model architecture|Model size|Language

Model Description

NeMo Megatron-mT5 3B is a multilingual transformer-based masked language model. mT5 [1] is a class of encoder-decoder models trained with a span-based masked language modeling objective on a dataset comprising documents from many different languages. We follow the T5v1.1 approach of pre-training using only the masked language modeling objective. It has Tensor Parallelism (TP) of 2, Pipeline Parallelism (PP) of 1 and should fit on a single NVIDIA GPU for inference and 2 A100 80G GPUs for finetuning.

This model was trained with NeMo Megatron.

NOTE: Weights are distributed in bfloat16.

List of Languages

We pre-trained our mT5 model on the following languages from the mC4 dataset.

  1. Japanese
  2. English
  3. Italian
  4. Latvian
  5. Russian
  6. Hungarian
  7. Chinese
  8. Polish
  9. Greek
  10. German
  11. Czech
  12. Korean
  13. Hindi
  14. Norwegian
  15. Danish
  16. Slovak
  17. French
  18. Portuguese
  19. Lithuanian
  20. Spanish
  21. Dutch
  22. Swedish
  23. Romanian
  24. Finnish

NOTE: The English data used to train our model is the smaller "clean" version (C4) used in the T5 paper and not the larger one distributed as part of mC4.

Getting started

Step 1: Install NeMo and dependencies

You will need to install NVIDIA Apex and NeMo.

git clone https://github.com/ericharper/apex.git
cd apex
git checkout nm_v1.11.0
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
pip install nemo_toolkit['nlp']==1.12.0

Alternatively, you can use NeMo Megatron training docker container with all dependencies pre-installed - https://developer.nvidia.com/nemo-megatron-open-beta?nvid=nv-int-tblg-249896

Step 2: Run inference

Note. The model has been trained with Tensor Parallelism (TP) of 2 and Pipeline Parallelism (PP) of 1, but it should be possible to run inference with tensor parallel size 1 on most NVIDIA GPUs

git clone https://github.com/NVIDIA/NeMo.git 
cd NeMo/examples/nlp/language_modeling
git checkout r1.12.0
python megatron_t5_eval.py \
    --model_file nemo_megatron_mt5_3b_bf16_tp2.nemo \
    --prompt "La capitale de la France est <mask>" \
    --tensor_model_parallel_size 2

The script will automatically replace all <mask> tokens with the appropriate sentinel tokens used while pre-training and attempt to fill them in autoregressively with greedy decoding.

Expected Response:

  'prompt': 'La capitale de la France est <mask>',
  'completion': {
    'text': 'Paris',
    'tokens': [(4586, '▁Paris', 0.0)]},
    'masked_input': '▁La ▁capital e ▁de ▁la ▁France ▁est ▁<extra_id_0>'
  • prompt: The provided raw prompt as input
  • completion:
    • text: The final generated text from the model along with special/sentinel tokens besides </s>
    • tokens: Each individual subword that is generated along with its log-probability.
  • masked_input: The original raw prompt with

Training Data

The model was trained on the mC4 dataset made available by AI2 and hosted on Huggingface.

Evaluation results

Zero-shot language transformer performance on the XNLI dataset for a model fine-tuned on MNLI.

English Spanish German French Chinese
89.4 86.4 84.5 85.8 79.9


The model was trained on the data originally crawled from the Internet. This data contains toxic language and societal biases. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts.


[1] mT5: A massively multilingual pre-trained text-to-text transformer

[2] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

[3] NVIDIA NeMo Toolkit

[4] XNLI: Evaluating Cross-lingual Sentence Representations


License to use this model is covered by the CC-BY-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.

Downloads last month
Unable to determine this model’s pipeline type. Check the docs .

Dataset used to train nvidia/nemo-megatron-mt5-3B