Lauler
/

UL2-nemo-conversion

Model card Files Files and versions Community

Faton Rekathati commited on May 10, 2023

Commit

0fd282e

1 Parent(s): e8619ad

UL2 conversion instructions

Browse files

Files changed (15) hide show

README.md +64 -0
config_ul2_base_nl36.json +32 -0
config_ul2_finnish.json +32 -0
convert_finnish_ul2_model.py +35 -0
convert_hf_to_nemo.sh +9 -0
convert_nemo_to_hf.sh +14 -0
convert_nemo_ul2_checkpoint.py +550 -0
hf_t5_v1_1_to_nemo.py +387 -0
hf_t5v1_1_base_config.yaml +143 -0
nemo_checkpoints/megatron_ul2--val_loss=2.54-step=7000-consumed_samples=14557920.0.ckpt +3 -0
nemo_checkpoints/megatron_ul2--val_loss=6.59-step=150-consumed_samples=309920.0-last.ckpt +3 -0
nemo_config/ul2-base-nl36/megatron.ul2-base-nl36.unigram-64k-pretok-small_data.all-clean.config.yaml +195 -0
nemo_config/ul2-base-nl36/megatron_model_ul2base_config.yaml +40 -0
nemo_singularity.def +11 -0
test_ul2_hf.py +62 -0

README.md ADDED Viewed

	@@ -0,0 +1,64 @@

+## Checkpoints and conversion scripts for Nemo cpkt files to Huggingface
+This repo contains two checkpoints (`.ckpt` files) for UL2 models we have started pretraining with Nemo. The checkpoints are found in `nemo_checkpoints/`. The Nemo config files used to train these models can be found in `nemo_config/ul2-base-nl36`.
+`megatron_ul2--val_loss=2.54-step=7000-consumed_samples=14557920.0.ckpt` was trained with `megatron_legacy: False` in the config, whereas the other checkpoint was trained with `megatron_legacy: True`.
+Nvidia have created a conversion script that converts T5, T5v1.1 and UL2 models on Huggingface Hub to Nemo format. The script can be found [here](https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/hf_t5-v1_1_to_nemo.py). It is also included in this repo.
+We thought that adapting a T5/UL2 model trained with Nemo to a Huggingface format would simply be a manner of reversing the conversion that was performed by the script above. Our conversion script does work assuming we operate directly on the `pt` state dict weight files produced by running the above Nvidia script. I.e. it works when going directly `Huggingface -> Nemo -> Huggingface`. However, it does not work when attempting to go `Nemo -> Huggingface`. An UL2 model that was initialized with Nemo Megatron, and pretrained with Nemo, does not produce same output when converted to Huggingface format.
+### Dependencies
+We use Nemo docker containers (tag `23.02`) via Singularity when running the code in this repo. We have included a definition file to build the container.
+To build the container:
+```bash
+sudo singularity build nemo2302.sif nemo_singularity.def
+```
+We provide bash scripts to execute with singularity. However, to debug easier you can also run singularity in interactive mode via:
+```bash
+singularity shell --nv nemo2302.sif
+```
+### Converting Nemo checkpoints to Huggingface
+We have included our conversion script in this repo. It can be found in `convert_nemo_ul2_checkpoint.py`.
+We manually created a Huggingface config file for UL2 that to the best of our knowledge matches the settings used when we trained with Nemo (see `config_ul2_base_nl36.json`).
+To replicate our weights conversion, simply run:
+```bash
+singularity exec --nv nemo2302.sif bash convert_nemo_to_hf.sh
+```
+The resulting Huggingface model will be saved to `ul2-base-nl36-swedish/`.
+We are aware that [Megatron-LM uses different ordering of QKV](https://github.com/NVIDIA/Megatron-LM/blob/42c1cf4279acea5a554500dcb552211f44cbec45/megatron/checkpointing.py#L209-L237) in the attention layers depending on the version of Megatron-LM used. We are also aware of an existing conversion script that Huggingface have created for converting Megatron-BERT to Huggingface, where they adapt the ordering of QKV in Megatron to [match the ordering used in Huggingface](https://github.com/NVIDIA/Megatron-LM/blob/42c1cf4279acea5a554500dcb552211f44cbec45/megatron/checkpointing.py#L209-L237). As such we have an optional `--fix_qkv` parameter in our conversion script that applies the same reordering of QKV as Huggingface does. See the lines that are commented out in `convert_nemo_to_hf.sh` for an example of how to use this parameter and set the `checkpoint_version`.
+Unfortunately, none of the above solves the issue we have with the conversion script.
+We have a test script that predicts both with the original Nemo model and with the converted Huggingface model. The output unfortunately isn't the same. We used the same identical tokenizer for both models. To run:
+```bash
+singularity exec --nv nemo2302.sif python test_ul2_hf.py
+```
+Or explore in interactive mode with `singularity shell --nv nemo2302.sif`.
+### Confirming the conversion script can reverse Nvidia's conversion script
+In order to confirm the conversion script is valid enough in the sense that it is able to reverse Nvidia's conversion script, we here include instructions to convert a UL2 model from Huggingface to Nemo, via Nvidia's conversion script, and then back to Huggingface via our conversion script.
+Instructions:
+1. Run `singularity exec --nv nemo2302.sif bash convert_hf_to_nemo.sh` to convert the existing [Finnish-NLP/ul2-base-nl36-finnish](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) from Huggingface to Nemo format via Nvidia's conversion script. The resultning model weights will be saved to the folder `ul2-base-nl36-finnish/`.
+2. To perform the reverse conversion, and to perform a check whether the re-converted weights are identical, run `python convert_finnish_ul2_model.py`. Or via singularity: `singularity exec --nv nemo2302.sif python convert_finnish_ul2_model.py`.
+The resuling model re-converted to Huggingface will be found in `ul2-base-nl36-finnish/hf_t5_ul2`.
+This conversion produces a model that is identical to the original model.

config_ul2_base_nl36.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+    "_name_or_path": "./",
+    "architectures": [
+        "T5ForConditionalGeneration"
+    ],
+    "d_ff": 2048,
+    "d_kv": 64,
+    "d_model": 768,
+    "decoder_start_token_id": 0,
+    "dense_act_fn": "silu",
+    "dropout_rate": 0.1,
+    "eos_token_id": 1,
+    "feed_forward_proj": "gated-silu",
+    "initializer_factor": 1.0,
+    "is_encoder_decoder": true,
+    "is_gated_act": true,
+    "layer_norm_epsilon": 1e-06,
+    "model_type": "t5",
+    "n_positions": 512,
+    "num_decoder_layers": 36,
+    "num_heads": 12,
+    "num_layers": 36,
+    "output_past": true,
+    "pad_token_id": 0,
+    "relative_attention_max_distance": 128,
+    "relative_attention_num_buckets": 32,
+    "tie_word_embeddings": true,
+    "torch_dtype": "float16",
+    "transformers_version": "4.26.1",
+    "use_cache": true,
+    "vocab_size": 64384
+}

config_ul2_finnish.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+    "_name_or_path": "./",
+    "architectures": [
+        "T5ForConditionalGeneration"
+    ],
+    "d_ff": 3072,
+    "d_kv": 64,
+    "d_model": 768,
+    "decoder_start_token_id": 0,
+    "dense_act_fn": "gelu_new",
+    "dropout_rate": 0.1,
+    "eos_token_id": 1,
+    "feed_forward_proj": "gated-gelu",
+    "initializer_factor": 1.0,
+    "is_encoder_decoder": true,
+    "is_gated_act": true,
+    "layer_norm_epsilon": 1e-06,
+    "model_type": "t5",
+    "n_positions": 512,
+    "num_decoder_layers": 36,
+    "num_heads": 12,
+    "num_layers": 36,
+    "output_past": true,
+    "pad_token_id": 0,
+    "relative_attention_max_distance": 128,
+    "relative_attention_num_buckets": 32,
+    "tie_word_embeddings": false,
+    "torch_dtype": "float32",
+    "transformers_version": "4.22.1",
+    "use_cache": true,
+    "vocab_size": 32128
+}

convert_finnish_ul2_model.py ADDED Viewed

	@@ -0,0 +1,35 @@

+import os
+import torch
+from convert_nemo_ul2_checkpoint import convert_nemo_to_hf
+from transformers import T5ForConditionalGeneration, AutoTokenizer
+#### Step 1: Convert the original HF model which was converted to NEMO back to HF weights
+nemo_weights = torch.load("ul2-base-nl36-finnish/nemo_state_dict.pt")
+hf_weights = convert_nemo_to_hf(nemo_weights)
+#### Step 2: Load original HF model and save its config/tokenizer in local folder
+hf_model = T5ForConditionalGeneration.from_pretrained("Finnish-NLP/ul2-base-nl36-finnish")
+tokenizer = AutoTokenizer.from_pretrained("Finnish-NLP/ul2-base-nl36-finnish")
+# Save tokenizer in ul2-base-nl36-finnish
+tokenizer.save_pretrained("ul2-base-nl36-finnish/hf_t5_ul2")
+# Save config in ul2-base-nl36-finnish
+hf_model.config.save_pretrained("ul2-base-nl36-finnish/hf_t5_ul2")
+#### Step 3: Save our converted weights to the local folder
+# Save converted model weights in ul2-base-nl36-finnish
+torch.save(hf_weights, os.path.join("ul2-base-nl36-finnish/hf_t5_ul2", "pytorch_model.bin"))
+#### Step4: Load the converted model from local folder and check whether weights are the same
+converted_model = T5ForConditionalGeneration.from_pretrained("ul2-base-nl36-finnish/hf_t5_ul2")
+equal = []
+for key in hf_model.state_dict().keys():
+    print(key)
+    print(torch.allclose(hf_model.state_dict()[key], converted_model.state_dict()[key]))
+    equal.append(torch.allclose(hf_model.state_dict()[key], converted_model.state_dict()[key]))
+print(f"All weights are equal: {all(equal)}")

convert_hf_to_nemo.sh ADDED Viewed

	@@ -0,0 +1,9 @@

+# singularity exec --nv nemo2302 bash convert_hf_to_nemo.sh
+OUTPUT_FOLDER=ul2-base-nl36-finnish
+mkdir -p $OUTPUT_FOLDER
+python hf_t5_v1_1_to_nemo.py \
+    --hf_model_name Finnish-NLP/ul2-base-nl36-finnish \
+    --nemo_state_dict $OUTPUT_FOLDER/nemo_state_dict.pt \
+    --nemo_file_path $OUTPUT_FOLDER/nemo_file.nemo

convert_nemo_to_hf.sh ADDED Viewed

	@@ -0,0 +1,14 @@

+# singularity exec --nv nemo2302 bash convert_nemo_to_hf.sh
+#### Convert model pretrained from scratch in Nemo Meagtron to HuggingFace format
+python convert_nemo_ul2_checkpoint.py \
+    --nemo_model_path=nemo_checkpoints/megatron_ul2--val_loss=2.54-step=7000-consumed_samples=14557920.0.ckpt \
+    --hf_config_path=config_ul2_base_nl36.json \
+    --output_path=ul2-base-nl36-swedish \
+    --hidden_size=768 \
+#    --num_heads=12 \
+#    --kv_dim=64 \
+#    --checkpoint_version=2.0 \
+#    --fix_qkv \
+#    --hf_model_path=ul2_base_nl36 \

convert_nemo_ul2_checkpoint.py ADDED Viewed

	@@ -0,0 +1,550 @@

+# Copyright (c) 2022, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2023, KBLab at the National Library of Sweden.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Script to convert NeMo Megatron T5/UL2 model to Huggingface T5 model.
+Based off of NVIDIA's conversion script at: https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/hf_t5-v1_1_to_nemo.py .
+We reverse their conversion process.
+NOTE: You may want to double check the conversion if you are using a custom config with shared_decoder_tokens_head_embeddings=False.
+"""
+import argparse
+import os
+import collections
+import sys
+import torch
+from nemo.collections.nlp.models.language_modeling.megatron_t5_model import MegatronT5Model
+from omegaconf.omegaconf import OmegaConf
+from pytorch_lightning.trainer.trainer import Trainer
+from transformers import AutoTokenizer, T5Config, T5ForConditionalGeneration
+# Make hidden_size, num_heads, kv_dim configurable as args with argparse
+def load_nemo_megatron_model(checkpoint_path, devices=1, num_nodes=1, accelerator="gpu"):
+    trainer = Trainer(devices=devices, num_nodes=num_nodes, accelerator=accelerator)
+    model = MegatronT5Model.load_from_checkpoint(checkpoint_path, trainer=trainer)
+    return model
+def load_huggingface_t5_model(model_config_path):
+    """
+    # You need to configure config yourself based on your hparams during training
+    # See examples of UL2 hugginface configs:
+    # https://huggingface.co/google/flan-ul2/blob/main/config.json
+    # https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish/blob/main/config.json
+    """
+    t5_config = T5Config.from_pretrained(model_config_path)
+    t5_model = T5ForConditionalGeneration(t5_config)
+    return t5_model
+def _get_model_type_block_layer_hf(k):
+    """
+    Get info from Huggingface model block and layer names
+    Returns model_type, block number, layer number.
+    """
+    if k.startswith("encoder"):
+        model_type = "encoder"
+    elif k.startswith("decoder"):
+        model_type = "decoder"
+    else:
+        raise ValueError(f"Unknown model type for {k}")
+    return model_type, int(k.split(".")[2]), int(k.split(".")[4])
+def _get_model_type_layer_nemo(k):
+    """
+    Get info from NeMo layer names.
+    Returns model_type, layer number.
+    5th element in the split is the layer number.
+    """
+    print(k)
+    if "encoder" in k:
+        model_type = "encoder"
+    elif "decoder" in k:
+        model_type = "decoder"
+    else:
+        raise ValueError(f"Unknown model type for {k}")
+    return model_type, int(k.split(".")[5])
+def fix_query_key_value_ordering(param, checkpoint_version, num_splits, num_heads, hidden_size):
+    # Permutes layout of param tensor to [num_splits * num_heads * hidden_size, :]
+    # for compatibility with later versions of NVIDIA Megatron-LM.
+    # The inverse operation is performed inside Megatron-LM to read checkpoints:
+    # https://github.com/NVIDIA/Megatron-LM/blob/v2.4/megatron/checkpointing.py#L209
+    # If param is the weight tensor of the self-attention block, the returned tensor
+    # will have to be transposed one more time to be read by HuggingFace BERT.
+    input_shape = param.size()
+    if checkpoint_version == 1.0:
+        # version 1.0 stores [num_heads * hidden_size * num_splits, :]
+        saved_shape = (num_heads, hidden_size, num_splits) + input_shape[1:]
+        param = param.view(*saved_shape)
+        param = param.transpose(0, 2)
+        param = param.transpose(1, 2).contiguous()
+    elif checkpoint_version >= 2.0:
+        # other versions store [num_heads * num_splits * hidden_size, :]
+        saved_shape = (num_heads, num_splits, hidden_size) + input_shape[1:]
+        param = param.view(*saved_shape)
+        param = param.transpose(0, 1).contiguous()
+    param = param.view(*input_shape)
+    return param
+def convert_nemo_to_hf(
+    nemo_weights, fix_qkv_ordering=False, hidden_size=768, num_heads=12, kv_dim=64, checkpoint_version=2.0
+):
+    """
+    Convert NeMo Megatron T5/UL2 model to Huggingface T5 model.
+    Args:
+        nemo_weights (dict): NeMo model weights (state dict).
+        fix_qkv_ordering (bool): Whether to fix the query, key, value ordering in the self-attention blocks.
+        hidden_size (int): Hidden size of the model.
+        num_heads (int): Number of attention heads.
+        kv_dim (int): Projection weights dimension in multi-head attention. Generally: hidden_size // num_heads.
+        checkpoint_version (float): Megatron checkpoint version (No idea how to get this from the checkpoint itself).
+    Returns:
+        hf_weights (dict): Huggingface model weights (state dict).
+    """
+    print(f"Found {len(nemo_weights.keys())} keys in the NeMo checkpoint")
+    hf_weights = collections.OrderedDict()
+    for k, v in nemo_weights.items():
+        #################################################
+        ###### Enc-Dec Embeddings and Output Layer ######
+        #################################################
+        # Tied decoder embedding and decoder output layer.
+        if k == "enc_dec_model.decoder_embedding.word_embeddings.weight":
+            # shared.weight, lm_head.weight, decoder.embed_tokens.weight and encoder.embed_tokens.weight
+            # are the same in HF when tied_word_embeddings=True in T5Config.
+            # Corresponding setting in NeMo config: share_decoder_tokens_head_embeddings=True (share decoder vocab embeddings and decoder LM Head)
+            # and share_token_embeddings=True (share encoder/decoder vocab embeddings).
+            # Shared decoder embeddings and LM head yield best result according to: https://aclanthology.org/2021.emnlp-main.465.pdf#page=7 .
+            # Check if encoder and decoder token embeddings are the same.
+            is_shared_encdec = torch.allclose(
+                v, nemo_weights["enc_dec_model.encoder_embedding.word_embeddings.weight"]
+            )
+            if is_shared_encdec:
+                print("Found shared encoder and decoder embeddings")
+                hf_weights["shared.weight"] = v
+            else:
+                ValueError(
+                    (
+                        f"Found separate encoder and decoder embeddings in NeMo checkpoint. \n"
+                        f"Not supported in T5 HF implementation. \n"
+                        f"You should probably set 'share_token_embeddings' to True in your NeMo config. \n"
+                    )
+                )
+        if k == "enc_dec_model.tokens_head.weight":
+            # This weight doesn't seem to exist in Nemo when share_decoder_tokens_head_embeddings=True.
+            # Don't worry though. If you set tie_word_embeddings=True in HF, this weight will be
+            # created automatically when loading the model in HF and tied to
+            # shared.weight / decoder.embed_tokens.weight.
+            hf_weights["lm_head.weight"] = v
+            print(f"Mapped {k} to lm_head.weight")
+        elif k == "enc_dec_model.tokens_head.bias":
+            # HF doesn't have a bias for lm_head.weight
+            ValueError(
+                (
+                    f"Found bias for lm_head.weight in NeMo checkpoint. This is not supported in HF T5 implementation. \n"
+                    f"You should probably set 'tokens_head_bias' to False in your NeMo config. \n"
+                    f"If your checkpoint is from older version of Megatron, you may also need to set 'share_decoder_tokens_head_embeddings' to False in NeMo config. \n"
+                    f"See: https://github.com/NVIDIA/NeMo/blob/557c4b7ae766faf050374e6b9a862e2e67385b10/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py#L231-L236"
+                )
+            )
+            # hf_weights["lm_head.bias"] = v
+            # print(f"Mapped {k} to lm_head.bias")
+        # Decoder embeddings
+        elif k == "enc_dec_model.decoder_embedding.word_embeddings.weight":
+            hf_weights["decoder.embed_tokens.weight"] = v
+        elif k == "enc_dec_model.encoder_embedding.word_embeddings.weight":
+            hf_weights["encoder.embed_tokens.weight"] = v
+            print(f"Mapped {k} to encoder.embed_tokens.weight")
+        #################################################
+        ################# RPE Weights ###################
+        #################################################
+        elif k == "enc_dec_model.encoder_relative_position_embedding.relative_position_embedding.weight":
+            hf_weights["encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"] = v
+            print(f"Mapped {k} to encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight")
+        elif k == "enc_dec_model.decoder_relative_position_embedding.relative_position_embedding.weight":
+            hf_weights["decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"] = v
+            print(f"Mapped {k} to decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight")
+        #################################################
+        #################$ LayerNorm ####################
+        #################################################
+        # Block in HF corresponds to layer in NeMo.
+        # Layer in HF does not correspond to anything in NeMo.
+        # In Huggingface: Layer 0 is input layer norm, layer 1 is layer norm on self attn output,
+        # layer 2 is layer norm for cross attn output in decoder.
+        # In NeMo, some layernorm layers (final layernorms) don't have layer number in the name.
+        # We take care of these early so _get_model_type_layer_nemo function doesn't fail.
+        elif "layernorm" in k:
+            if "final" in k:
+                model_type = "encoder" if "encoder" in k else "decoder"
+                # Layer 2 in HF is always FFN + LayerNorm
+                hf_weights[f"{model_type}.final_layer_norm.weight"] = v
+                print(f"Mapped {k} to {model_type}.final_layer_norm.weight")
+                # if "bias" in k:
+                #     hf_weights[f"{model_type}.block.final_layer_norm.bias"] = v
+                #     print(f"Mapped {k} to {model_type}.block.final_layer_norm.bias")
+            else:
+                model_type, layer_number = _get_model_type_layer_nemo(k)
+                if "input_layernorm" in k and model_type == "encoder":
+                    # Input layer norm is always layer 0 in HF
+                    hf_weights[f"encoder.block.{layer_number}.layer.0.layer_norm.weight"] = v
+                    print(f"Mapped {k} to encoder.block.{layer_number}.layer.0.layer_norm.weight")
+                    # if "bias" in k:
+                    #     hf_weights[f"encoder.block.{layer_number}.layer.0.layer_norm.bias"] = v
+                    #     print(f"Mapped {k} to encoder.block.{layer_number}.layer.0.layer_norm.bias")
+                elif "post_attention_layernorm" in k and model_type == "encoder":
+                    # Layer 1 in HF is layer norm for self attn output
+                    hf_weights[f"{model_type}.block.{layer_number}.layer.1.layer_norm.weight"] = v
+                    print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.layer_norm.weight")
+                    # if "bias" in k:
+                    #     hf_weights[f"{model_type}.block.{layer_number}.layer.1.layer_norm.bias"] = v
+                    #     print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.layer_norm.bias")
+                elif "input_layernorm" in k and model_type == "decoder":
+                    # Input layer norm is always layer 0 in HF
+                    hf_weights[f"decoder.block.{layer_number}.layer.0.layer_norm.weight"] = v
+                    print(f"Mapped {k} to decoder.block.{layer_number}.layer.0.layer_norm.weight")
+                    # if "bias" in k:
+                    #     hf_weights[f"decoder.block.{layer_number}.layer.0.layer_norm.bias"] = v
+                    #     print(f"Mapped {k} to decoder.block.{layer_number}.layer.0.layer_norm.bias")
+                elif "post_attention_layernorm" in k and model_type == "decoder":
+                    # Layer 1 in HF is layer norm for self attn output
+                    hf_weights[f"{model_type}.block.{layer_number}.layer.1.layer_norm.weight"] = v
+                    print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.layer_norm.weight")
+                    # if "bias" in k:
+                    #     hf_weights[f"{model_type}.block.{layer_number}.layer.1.layer_norm.bias"] = v
+                    #     print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.layer_norm.bias")
+                elif "post_inter_attention_layernorm" in k and model_type == "decoder":
+                    # Layer 2 in HF is layer norm for cross attn output
+                    hf_weights[f"{model_type}.block.{layer_number}.layer.2.layer_norm.weight"] = v
+                    print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.2.layer_norm.weight")
+                    # if "bias" in k:
+                    #     hf_weights[f"{model_type}.block.{layer_number}.layer.2.layer_norm.bias"] = v
+                    #     print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.2.layer_norm.bias")
+                else:
+                    raise ValueError("Unknown layer_norm key: {}".format(k))
+        #################################################
+        ############### Attention Layers ################
+        #################################################
+        # Self-Attention
+        # Q, k, V in NeMo-Megatron is bundled into a single matrix.
+        elif "self_attention.query_key_value.weight" in k:
+            # Example naming in HF:
+            # encoder.block.0.layer.0.SelfAttention.q.weight
+            # decoder.block.0.layer.0.SelfAttention.q.weight
+            # Model type is either "encoder" or "decoder"
+            model_type, layer_number = _get_model_type_layer_nemo(k)
+            if fix_qkv_ordering:
+                out_val = fix_query_key_value_ordering(
+                    v, checkpoint_version=checkpoint_version, num_splits=3, num_heads=num_heads, hidden_size=kv_dim
+                )
+            else:
+                out_val = v
+            q_weights = out_val[0 * hidden_size : 1 * hidden_size, :]
+            k_weights = out_val[1 * hidden_size : 2 * hidden_size, :]
+            v_weights = out_val[2 * hidden_size : 3 * hidden_size, :]
+            # Layer 0 in HF is always self attn
+            hf_weights[f"{model_type}.block.{layer_number}.layer.0.SelfAttention.q.weight"] = q_weights
+            hf_weights[f"{model_type}.block.{layer_number}.layer.0.SelfAttention.k.weight"] = k_weights
+            hf_weights[f"{model_type}.block.{layer_number}.layer.0.SelfAttention.v.weight"] = v_weights
+            print(
+                (
+                    f"Mapped {k} to: \n",
+                    f"{model_type}.block.{layer_number}.layer.0.SelfAttention.q.weight \n",
+                    f"{model_type}.block.{layer_number}.layer.0.SelfAttention.k.weight \n",
+                    f"{model_type}.block.{layer_number}.layer.0.SelfAttention.v.weight \n",
+                )
+            )
+        # If we trained with bias=True in NeMo we will have bias terms for all weight matrices.
+        # Huggingface doesn't support optional bias terms in their T5 implementation.
+        elif "self_attention.query_key_value.bias" in k:
+            ValueError(
+                "Bias terms for most weights are not supported in Huggingface T5. Train with bias=False in NeMo config."
+            )
+        # Output self-attn matrix.
+        elif "self_attention.dense.weight" in k:
+            model_type, layer_number = _get_model_type_layer_nemo(k)
+            # Layer 0 in HF still always self attn
+            hf_weights[f"{model_type}.block.{layer_number}.layer.0.SelfAttention.o.weight"] = v
+            print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.0.SelfAttention.o.weight")
+        # Cross-Attention projection matrices are merged into K, V matrices in NeMo-Megatron.
+        # Need to split them into K, V matrices in HF.
+        elif "inter_attention.key_value.weight" in k:
+            model_type, layer_number = _get_model_type_layer_nemo(k)
+            if fix_qkv_ordering:
+                out_val = fix_query_key_value_ordering(
+                    v, checkpoint_version=checkpoint_version, num_splits=2, num_heads=num_heads, hidden_size=kv_dim
+                )
+            else:
+                out_val = v
+            # Layer 1 in HF is always cross attn
+            k_weights = out_val[0 * hidden_size : 1 * hidden_size, :]
+            v_weights = out_val[1 * hidden_size : 2 * hidden_size, :]
+            hf_weights[f"decoder.block.{layer_number}.layer.1.EncDecAttention.k.weight"] = k_weights
+            hf_weights[f"decoder.block.{layer_number}.layer.1.EncDecAttention.v.weight"] = v_weights
+            print(
+                (
+                    f"Mapped {k} to: \n",
+                    f"decoder.block.{layer_number}.layer.1.EncDecAttention.k.weight \n",
+                    f"decoder.block.{layer_number}.layer.1.EncDecAttention.v.weight \n",
+                )
+            )
+        # Cross-Attention Q matrix is separate in NeMo-Megatron and HF.
+        elif "inter_attention.query.weight" in k:
+            model_type, layer_number = _get_model_type_layer_nemo(k)
+            # Layer 1 in HF is always cross attn
+            hf_weights[f"decoder.block.{layer_number}.layer.1.EncDecAttention.q.weight"] = v
+            print(f"Mapped {k} to decoder.block.{layer_number}.layer.1.EncDecAttention.q.weight")
+        # Output cross-attention matrix.
+        elif "inter_attention.dense.weight" in k:
+            model_type, layer_number = _get_model_type_layer_nemo(k)
+            # Layer 1 in HF is always cross attn
+            hf_weights[f"decoder.block.{layer_number}.layer.1.EncDecAttention.o.weight"] = v
+            print(f"Mapped {k} to decoder.block.{layer_number}.layer.1.EncDecAttention.o.weight")
+        #################################################
+        #################$ FFN Layers ###################
+        #################################################
+        elif "mlp.dense_h_to_4h.weight" in k:
+            model_type, layer_number = _get_model_type_layer_nemo(k)
+            if model_type == "encoder":
+                # FFN + LayerNorm is always layer 1 in HF encoder attention blocks.
+                hf_weights[f"{model_type}.block.{layer_number}.layer.1.DenseReluDense.wi_0.weight"] = v
+                print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.DenseReluDense.wi_0.weight")
+            elif model_type == "decoder":
+                # FFN + LayerNorm is always layer 2 in HF decoder attention blocks.
+                hf_weights[f"{model_type}.block.{layer_number}.layer.2.DenseReluDense.wi_0.weight"] = v
+                print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.2.DenseReluDense.wi_0.weight")
+        elif "mlp.dense_h_to_4h_2.weight" in k:
+            model_type, layer_number = _get_model_type_layer_nemo(k)
+            if model_type == "encoder":
+                # FFN + LayerNorm is always layer 1 in HF encoder attention blocks.
+                hf_weights[f"{model_type}.block.{layer_number}.layer.1.DenseReluDense.wi_1.weight"] = v
+                print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.DenseReluDense.wi_1.weight")
+            elif model_type == "decoder":
+                # FFN + LayerNorm is always layer 2 in HF decoder attention blocks.
+                hf_weights[f"{model_type}.block.{layer_number}.layer.2.DenseReluDense.wi_1.weight"] = v
+                print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.2.DenseReluDense.wi_1.weight")
+        elif "mlp.dense_4h_to_h.weight" in k:
+            model_type, layer_number = _get_model_type_layer_nemo(k)
+            # Layer 2 in HF is always FFN + LayerNorm
+            if model_type == "encoder":
+                # FFN + LayerNorm is always layer 1 in HF encoder attention blocks.
+                hf_weights[f"{model_type}.block.{layer_number}.layer.1.DenseReluDense.wo.weight"] = v
+                print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.DenseReluDense.wo.weight")
+            elif model_type == "decoder":
+                # FFN + LayerNorm is always layer 2 in HF decoder attention blocks.
+                hf_weights[f"{model_type}.block.{layer_number}.layer.2.DenseReluDense.wo.weight"] = v
+                print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.2.DenseReluDense.wo.weight")
+        else:
+            raise ValueError(f"Unknown key: {k}")
+    print("Done mapping weights. \n")
+    print(f"Total keys in converted Huggingface weight mapping: {len(hf_weights.keys())} \n")
+    return hf_weights
+# singularity shell --nv data/nemo2302
+def compare_weights_hf_nemo(model, hf_weights, hf_config_path, hf_model_path=None):
+    """
+    Compares the weights of a Huggingface initialized model against Nemo model converted to HF.
+    Prints if there are any missing keys that were expected but not mapped.
+    Also compares parameter count of HF initialized model against original unconverted Nemo model.
+    Args:
+        model: NeMo model
+        hf_weights: Dictionary of Huggingface weights
+        hf_config_path: Path to Huggingface config file to initialize model from.
+        hf_model_path: Path to Huggingface Hub or local HF model folder, if you alternatively want to
+            load/initialize from an existing model on HF Hub or disk (optional)
+    """
+    if args.hf_model_path:
+        # If user supplies a HF hub model path, or local converted model, we load the model from there.
+        hf_model = T5ForConditionalGeneration.from_pretrained(hf_model_path)
+    else:
+        # Otherwise, we load the model from the config.
+        hf_model = load_huggingface_t5_model(hf_config_path)
+    print(f"Total keys in converted Huggingface weight mapping: {len(hf_weights.keys())} \n")
+    print(f"Total keys in Huggingface model initialized from config or HF Hub: {len(hf_model.state_dict().keys())} \n")
+    # Count the number of parameters in the model
+    print(
+        f"Number of parameters in HF model initialized from config or HF hub: {sum(p.numel() for p in hf_model.parameters() if p.requires_grad)}"
+    )
+    # Number of parameters in Nemo model
+    print(f"Number of parameters in Nemo model: {sum(p.numel() for p in model.parameters() if p.requires_grad)} \n")
+    # Check the set difference between the two sets of model keys (model loaded from config and converted model)
+    print(
+        (
+            f"Keys in converted HF weight mapping but missing in HF model initialized from config.json: \n"
+            f"{set(hf_weights.keys()) - set(hf_model.state_dict().keys())} \n"
+        )
+    )
+    print(
+        (
+            f"Keys in HF model initialized from config.json but missing in converted HF weight mapping: \n"
+            f"{set(hf_model.state_dict().keys()) - set(hf_weights.keys())} \n"
+        )
+    )
+    print(
+        (
+            f"It is expected that lm_head.weight is missing from converted HF weight mapping \n"
+            f"if you have set share_decoder_tokens_head_embeddings=True in your Nemo config. \n"
+            f"This weight doesn't exist in Nemo, as it is shared with the decoder token embeddings. \n \n"
+            f"In Huggingface, weights for lm_head.weight and decoder token embeddings are generally duplicated \n"
+            f"in the state_dict. When missing, the lm_head.weight is automatically initialized from shared decoder \n"
+            f"token embeddings weights if your HF config.json has tie_word_embeddings=True."
+        )
+    )
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Convert Nemo T5/UL2 model to Huggingface T5/UL2 model")
+    parser.add_argument(
+        "--nemo_model_path",
+        type=str,
+        required=True,
+        help="Path to Nemo T5/UL2 model .ckpt file",
+    )
+    parser.add_argument(
+        "--hf_config_path",
+        type=str,
+        required=True,
+        help="Path to Huggingface T5 config.json",
+    )
+    parser.add_argument(
+        "--hf_model_path",
+        type=str,
+        required=False,
+        help="Path to Huggingface T5 model, local folder or HF hub model",
+    )
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        required=True,
+        help="Folder to save converted Huggingface T5/UL2 model in",
+    )
+    parser.add_argument("--hidden_size", type=int, default=768, help="Hidden size of Nemo model")
+    parser.add_argument("--num_heads", type=int, default=12, help="Number of attention heads in Nemo model")
+    # Default False if --fix_qkv not specified
+    parser.add_argument("--fix_qkv", action="store_true", help="Fix QKV weights in converted HF model")
+    parser.add_argument("--checkpoint_version", type=float, default=2.0, help="Checkpoint version of Nemo model")
+    parser.add_argument(
+        "--kv_dim", type=int, default=64, help="Key/Value dimension of Nemo model. Typically hidden_size // num_heads"
+    )
+    args = parser.parse_args()
+    #### Convert Nemo T5/UL2 model to Huggingface T5/UL2 model
+    model = load_nemo_megatron_model(checkpoint_path=args.nemo_model_path)
+    nemo_weights = model.state_dict()
+    hf_weights = convert_nemo_to_hf(
+        nemo_weights=nemo_weights,
+        fix_qkv_ordering=args.fix_qkv,
+        hidden_size=args.hidden_size,
+        num_heads=args.num_heads,
+        kv_dim=args.kv_dim,
+        checkpoint_version=args.checkpoint_version,
+    )
+    # We trained with a HF tokenizer, we grab it from the Nemo model.
+    tokenizer = model.tokenizer.__dict__["tokenizer"]
+    # We manually create HF config.json that matches architecture of the nemo model
+    # (or grab one from existing model on HF Hub and modify where necessary).
+    # See example config.json
+    config = T5Config.from_json_file(args.hf_config_path)
+    # Save config
+    config.save_pretrained(args.output_path)
+    print(f"Saved config to {os.path.join(args.output_path, 'config.json')}")
+    # Save tokenizer
+    tokenizer.save_pretrained(args.output_path)
+    print(f"Saved tokenizer to {os.path.join(args.output_path, 'tokenizer.json')}")
+    # Save the converted weights to a file
+    torch.save(hf_weights, os.path.join(args.output_path, "pytorch_model.bin"))
+    print(f"Saved converted weights to {os.path.join(args.output_path, 'pytorch_model.bin')}")
+    # Sanity check
+    compare_weights_hf_nemo(model, hf_weights, hf_config_path=args.hf_config_path)

hf_t5_v1_1_to_nemo.py ADDED Viewed

	@@ -0,0 +1,387 @@

+# Copyright (c) 2022, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This script generates a NeMo-Megatron compatible `.nemo` file for a Huggingface T5-v1_1 model.
+List of Huggingface models that this script can covert:
+1. google/t5-v1_1-small
+2. google/t5-v1_1-base
+3. google/t5-v1_1-large
+4. google/t5-v1_1-xl
+5. google/t5-v1_1-xxl
+6. google/mt5-small
+7. google/mt5-base
+8. google/mt5-large
+9. google/mt5-xl
+10. google/mt5-xxl
+11. google/ul2
+13. bigscience/T0pp
+14. google/t5-small-lm-adapt
+15. google/t5-base-lm-adapt
+16. google/t5-large-lm-adapt
+17. google/t5-xl-lm-adapt
+18. google/t5-xxl-lm-adapt
+19. google/flan-t5-small
+20. google/flan-t5-base
+21. google/flan-t5-large
+22. google/flan-t5-xl
+23. google/flan-t5-xxl
+Use instructions:
+python hf_t5-v1_1_to_nemo.py \
+    --hf_model_name bigscience/T0pp \
+    --nemo_state_dict /path/to/nemo_state_dict.pt \
+    --nemo_file_path /path/to/nemo_file.nemo
+"""
+import collections
+import os
+import tempfile
+from argparse import ArgumentParser
+import torch
+from omegaconf.omegaconf import OmegaConf, open_dict
+from pytorch_lightning import Trainer
+from transformers import AutoTokenizer, T5ForConditionalGeneration
+from nemo.collections.nlp.models.language_modeling.megatron_t5_model import MegatronT5Model
+from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector
+try:
+    import accelerate
+except ImportError:
+    raise ImportError("Please install accelerate package via `pip install accelerate` to use this script.")
+def convert_weights(hf_model, nemo_state_dict_path):
+    if hf_model == "google/ul2":
+        torch_dtype = torch.bfloat16
+    else:
+        torch_dtype = torch.float32
+    hf_model = T5ForConditionalGeneration.from_pretrained(hf_model, low_cpu_mem_usage=True, torch_dtype=torch_dtype)
+    hf_model_config = hf_model.config
+    with tempfile.TemporaryDirectory() as tmp:
+        torch.save(hf_model.state_dict(), os.path.join(tmp, "model.pt"))
+        hf_weights = torch.load(os.path.join(tmp, "model.pt"))
+    nemo_weights = collections.OrderedDict()
+    print(f"Found {len(hf_weights.keys())} keys in the checkpoint")
+    def _get_model_type_block_layer(k):
+        if k.startswith("encoder"):
+            model_type = "encoder"
+        elif k.startswith("decoder"):
+            model_type = "decoder"
+        else:
+            raise ValueError(f"Unknown model type for {k}")
+        return model_type, int(k.split(".")[2]), int(k.split(".")[4])
+    for k, v in hf_weights.items():
+        #################################################
+        ###### Enc-Dec Embeddings and Output Layer ######
+        #################################################
+        # Tied decoder embedding and decoder output layer.
+        if k == "shared.weight":
+            pass
+        elif k == "lm_head.weight":
+            nemo_weights["enc_dec_model.tokens_head.weight"] = v
+            print(
+                f"Mapped {k} to enc_dec_model.decoder_embedding.word_embeddings.weight and enc_dec_model.tokens_head.weight"
+            )
+        # Decoder embeddings
+        elif k == "decoder.embed_tokens.weight":
+            nemo_weights["enc_dec_model.decoder_embedding.word_embeddings.weight"] = v
+        elif k == "encoder.embed_tokens.weight":
+            nemo_weights["enc_dec_model.encoder_embedding.word_embeddings.weight"] = v
+            print(f"Mapped {k} to enc_dec_model.encoder_embedding.word_embeddings.weight")
+        #################################################
+        ################# RPE Weights ###################
+        #################################################
+        elif k == "encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight":
+            nemo_weights["enc_dec_model.encoder_relative_position_embedding.relative_position_embedding.weight"] = v
+            print(
+                f"Mapped {k} to enc_dec_model.encoder_relative_position_embedding.relative_position_embedding.weight"
+            )
+        elif k == "decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight":
+            nemo_weights["enc_dec_model.decoder_relative_position_embedding.relative_position_embedding.weight"] = v
+            print(
+                f"Mapped {k} to enc_dec_model.decoder_relative_position_embedding.relative_position_embedding.weight"
+            )
+        # Block in HF corresponds to layer in NeMo.
+        # Layer in HF does not correspond to anything in NeMo. Layer 0 is self attn, layer 1 is cross-attn.
+        #################################################
+        ############### Attention Layers ################
+        #################################################
+        # Self-Attention
+        # Q, k, V in NeMo-Megatron is bundled into a single matrix.
+        elif "SelfAttention.q.weight" in k:
+            model_type, block_number, layer_number = _get_model_type_block_layer(k)
+            k_weight = hf_weights[k.replace("q.weight", "k.weight")]
+            v_weight = hf_weights[k.replace("q.weight", "v.weight")]
+            concat_weights = torch.cat([v, k_weight, v_weight], dim=0)
+            nemo_weights[
+                f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.self_attention.query_key_value.weight"
+            ] = concat_weights
+            print(
+                f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.self_attention.query_key_value.weight"
+            )
+        # We can skip processing of k, v weights since we already concat them into qkv above.
+        elif "SelfAttention.k.weight" in k or "SelfAttention.v.weight" in k:
+            pass
+        # Output self-attn matrix.
+        elif "SelfAttention.o.weight" in k:
+            model_type, block_number, layer_number = _get_model_type_block_layer(k)
+            block_number = int(k.split(".")[2])  # Block in HF corresponds to layer in NeMo.
+            layer_number = int(
+                k.split(".")[4]
+            )  # Layer in HF does not correspond to anything in NeMo. Layer 0 is self attn, layer 1 is cross-attn.
+            nemo_weights[
+                f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.self_attention.dense.weight"
+            ] = v
+            print(
+                f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.self_attention.dense.weight"
+            )
+        # Cross-Attention projection matrices are merged into K, V matrices in NeMo-Megatron
+        elif "EncDecAttention.k.weight" in k:
+            model_type, block_number, layer_number = _get_model_type_block_layer(k)
+            v_weight = hf_weights[k.replace("k.weight", "v.weight")]
+            concat_weights = torch.cat([v, v_weight], dim=0)
+            nemo_weights[
+                f"enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.key_value.weight"
+            ] = concat_weights
+            print(
+                f"Mapped {k} to enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.key_value.weight"
+            )
+        # We can skip processing of v weights since we already concat them with k above.
+        elif "EncDecAttention.v.weight" in k:
+            pass
+        # Cross-Attention Q matrix is separate in NeMo-Megatron
+        elif "EncDecAttention.q.weight" in k:
+            model_type, block_number, layer_number = _get_model_type_block_layer(k)
+            nemo_weights[
+                f"enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.query.weight"
+            ] = v
+            print(
+                f"Mapped {k} to enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.query.weight"
+            )
+        # Cross-Attention Q matrix is separate in NeMo-Megatron
+        elif "EncDecAttention.o.weight" in k:
+            model_type, block_number, layer_number = _get_model_type_block_layer(k)
+            nemo_weights[
+                f"enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.dense.weight"
+            ] = v
+            print(
+                f"Mapped {k} to enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.dense.weight"
+            )
+        #################################################
+        #################$ FFN Layers ###################
+        #################################################
+        elif "DenseReluDense.wi_0.weight" in k:
+            model_type, block_number, layer_number = _get_model_type_block_layer(k)
+            nemo_weights[
+                f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_h_to_4h.weight"
+            ] = v
+            print(
+                f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_h_to_4h.weight"
+            )
+        elif "DenseReluDense.wi_1.weight" in k:
+            model_type, block_number, layer_number = _get_model_type_block_layer(k)
+            nemo_weights[
+                f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_h_to_4h_2.weight"
+            ] = v
+            print(
+                f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_h_to_4h_2.weight"
+            )
+        elif "DenseReluDense.wo.weight" in k:
+            model_type, block_number, layer_number = _get_model_type_block_layer(k)
+            nemo_weights[
+                f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_4h_to_h.weight"
+            ] = v
+            print(
+                f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_4h_to_h.weight"
+            )
+        #################################################
+        #################$ LayerNorm ####################
+        #################################################
+        elif "layer_norm" in k:
+            if "final" in k:
+                model_type = "encoder" if k.startswith("encoder") else "decoder"
+                nemo_weights[f"enc_dec_model.enc_dec_model.{model_type}.model.final_layernorm.weight"] = v
+                print(f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.final_layernorm.weight")
+            else:
+                model_type, block_number, layer_number = _get_model_type_block_layer(k)
+                if layer_number == 0 and model_type == "encoder":
+                    nemo_weights[
+                        f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.input_layernorm.weight"
+                    ] = v
+                    print(
+                        f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.input_layernorm.weight"
+                    )
+                elif layer_number == 1 and model_type == "encoder":
+                    nemo_weights[
+                        f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_attention_layernorm.weight"
+                    ] = v
+                    print(
+                        f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_attention_layernorm.weight"
+                    )
+                elif layer_number == 0 and model_type == "decoder":
+                    nemo_weights[
+                        f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.input_layernorm.weight"
+                    ] = v
+                    print(
+                        f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.input_layernorm.weight"
+                    )
+                elif layer_number == 1 and model_type == "decoder":
+                    nemo_weights[
+                        f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_attention_layernorm.weight"
+                    ] = v
+                    print(
+                        f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_attention_layernorm.weight"
+                    )
+                elif layer_number == 2 and model_type == "decoder":
+                    nemo_weights[
+                        f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_inter_attention_layernorm.weight"
+                    ] = v
+                    print(
+                        f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_inter_attention_layernorm.weight"
+                    )
+                else:
+                    raise ValueError("Unknown layer_norm key: {}".format(k))
+        else:
+            raise ValueError(f"Unknown key: {k}")
+    torch.save(nemo_weights, nemo_state_dict_path)
+    print("Saved weights to {}".format(nemo_state_dict_path))
+    return hf_model_config
+def package_into_nemo_file(
+    state_dict_path, base_yaml_config, hf_model_config, nemo_file_path, hf_model_name, megatron_amp_O2
+):
+    """
+    Packages the state dict, config file and tokenizer into a `.nemo` file.
+    """
+    trainer = Trainer(devices=1, strategy=NLPDDPStrategy(), accelerator="cpu", precision=32)
+    base_cfg = OmegaConf.load(base_yaml_config)
+    if hf_model_config.dense_act_fn == "silu":
+        act_fn = "swiglu"
+    elif hf_model_config.dense_act_fn == "gelu_new":
+        act_fn = "geglu"
+    # FLAN-T5 models have things configured this way.
+    elif hf_model_config.dense_act_fn == "gelu" and hf_model_config.is_gated_act:
+        act_fn = "geglu"
+    else:
+        raise ValueError(f"Unknown dense_act_fn: {hf_model_config.dense_act_fn}")
+    with open_dict(base_cfg):
+        base_cfg.encoder.num_layers = hf_model_config.num_layers
+        base_cfg.encoder.hidden_size = hf_model_config.d_model
+        base_cfg.encoder.ffn_hidden_size = hf_model_config.d_ff
+        base_cfg.encoder.kv_channels = hf_model_config.d_kv
+        base_cfg.encoder.num_attention_heads = hf_model_config.num_heads
+        base_cfg.encoder.activation = act_fn
+        base_cfg.encoder.relative_attention_num_buckets = hf_model_config.relative_attention_num_buckets
+        base_cfg.decoder.num_layers = hf_model_config.num_decoder_layers
+        base_cfg.decoder.hidden_size = hf_model_config.d_model
+        base_cfg.decoder.ffn_hidden_size = hf_model_config.d_ff
+        base_cfg.decoder.kv_channels = hf_model_config.d_kv
+        base_cfg.decoder.num_attention_heads = hf_model_config.num_heads
+        base_cfg.decoder.activation = act_fn
+        base_cfg.decoder.relative_attention_num_buckets = hf_model_config.relative_attention_num_buckets
+        base_cfg.megatron_amp_O2 = megatron_amp_O2
+    with tempfile.TemporaryDirectory() as tmp:
+        tokenizer = AutoTokenizer.from_pretrained(hf_model_name)
+        tokenizer_path = tokenizer.save_vocabulary(tmp)[0]
+        base_cfg.tokenizer.model = tokenizer_path
+        model = MegatronT5Model(base_cfg, trainer).to("cpu")
+        model._save_restore_connector = NLPSaveRestoreConnector()
+        state_dict = torch.load(state_dict_path)
+        if megatron_amp_O2:
+            new_state_dict = {}
+            for key in state_dict.keys():
+                new_key = key.replace("model.", "model.module.", 1)
+                new_state_dict[new_key] = state_dict[key]
+            state_dict = new_state_dict
+        model.load_state_dict(state_dict)
+        model.save_to(nemo_file_path)
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument(
+        "--hf_model_name",
+        type=str,
+        required=True,
+        help="Valid Huggingface T5v1_1 model name ex: google/t5-v1_1-large or google/ul2. Example something that can be loaded with T5ForConditionalGeneration.from_pretrained()",
+    )
+    parser.add_argument(
+        "--nemo_state_dict_path",
+        type=str,
+        required=True,
+        help="Path to write the intermediate nemo state dict file ex: /path/to/nemo_state_dict.pt",
+    )
+    parser.add_argument(
+        "--nemo_file_path",
+        type=str,
+        required=True,
+        help="Path to write the converted .nemo file ex: /path/to/t5_base_converted_to_nemo.nemo",
+    )
+    parser.add_argument(
+        "--base_yaml_config",
+        type=str,
+        default="hf_t5v1_1_base_config.yaml",
+        help="Path to a base yaml config that we edit based on the provided model.",
+    )
+    parser.add_argument(
+        "--megatron_amp_O2",
+        action="store_true",
+        help="Whether to store O2 weights. This may be useful for models like ul2 where only pre-trained half precision weights were released.",
+    )
+    args = parser.parse_args()
+    if not os.path.exists(args.base_yaml_config):
+        raise FileNotFoundError(f"Base yaml config file {args.base_yaml_config} does not exist.")
+    hf_model_config = convert_weights(args.hf_model_name, args.nemo_state_dict_path)
+    package_into_nemo_file(
+        state_dict_path=args.nemo_state_dict_path,
+        base_yaml_config=args.base_yaml_config,
+        hf_model_config=hf_model_config,
+        nemo_file_path=args.nemo_file_path,
+        hf_model_name=args.hf_model_name,
+        megatron_amp_O2=args.megatron_amp_O2,
+    )

hf_t5v1_1_base_config.yaml ADDED Viewed

	@@ -0,0 +1,143 @@

+encoder:
+  num_layers: 8
+  hidden_size: 512
+  ffn_hidden_size: 1024
+  num_attention_heads: 6
+  init_method_std: 0.02
+  hidden_dropout: 0.0
+  attention_dropout: 0.0
+  ffn_dropout: 0.0
+  position_embedding_type: relative
+  relative_attention_num_buckets: 32
+  relative_attention_max_distance: 128
+  relative_position_bias_self_attention_only: true
+  kv_channels: 64
+  apply_query_key_layer_scaling: false
+  layernorm_epsilon: 1.0e-06
+  persist_layer_norm: true
+  bias_activation_fusion: false
+  grad_div_ar_fusion: true
+  masked_softmax_fusion: false
+  bias_dropout_add_fusion: false
+  bias: false
+  normalization: rmsnorm
+  arch: transformer
+  activation: geglu
+  headscale: false
+  transformer_block_type: pre_ln
+  hidden_steps: 32
+  num_self_attention_per_cross_attention: 1
+  openai_gelu: true
+  onnx_safe: false
+  fp32_residual_connection: false
+  activations_checkpoint_method: null
+  activations_checkpoint_num_layers: 1
+  megatron_legacy: true
+  normalize_attention_scores: false
+decoder:
+  num_layers: 8
+  hidden_size: 512
+  ffn_hidden_size: 1024
+  num_attention_heads: 6
+  init_method_std: 0.02
+  hidden_dropout: 0.0
+  attention_dropout: 0.0
+  ffn_dropout: 0.0
+  position_embedding_type: relative
+  relative_attention_num_buckets: 32
+  relative_attention_max_distance: 128
+  relative_position_bias_self_attention_only: true
+  kv_channels: 64
+  apply_query_key_layer_scaling: false
+  layernorm_epsilon: 1.0e-06
+  persist_layer_norm: true
+  bias_activation_fusion: false
+  grad_div_ar_fusion: true
+  masked_softmax_fusion: false
+  bias_dropout_add_fusion: false
+  bias: false
+  normalization: rmsnorm
+  arch: transformer
+  activation: geglu
+  headscale: false
+  transformer_block_type: pre_ln
+  hidden_steps: 32
+  num_self_attention_per_cross_attention: 1
+  openai_gelu: true
+  onnx_safe: false
+  fp32_residual_connection: false
+  activations_checkpoint_method: null
+  activations_checkpoint_num_layers: 1
+  megatron_legacy: true
+  normalize_attention_scores: false
+micro_batch_size: 4
+global_batch_size: 8
+tensor_model_parallel_size: 1
+pipeline_model_parallel_size: 1
+resume_from_checkpoint: null
+pipeline_model_parallel_split_rank: 0
+make_vocab_size_divisible_by: 128
+megatron_amp_O2: false
+grad_allreduce_chunk_size_mb: 125
+grad_div_ar_fusion: true
+gradient_as_bucket_view: true
+seq_length: 512
+max_position_embeddings: 512
+tokenizer:
+  library: sentencepiece
+  type: null
+  model: nemo:ce65b6d8f4fb4975955e935db699cba3_t5_small_tokenizer.model
+  vocab_file: null
+  merge_file: null
+  num_sentinel_tokens: 100
+  sentencepiece_legacy: true
+  add_sentinel_tokens_in_reverse_order: true
+  add_sentinel_tokens_first: true
+embedding_init_method_std: 0.02
+embedding_dropout: 0.1
+share_token_embeddings: true
+share_decoder_tokens_head_embeddings: false
+tokens_head_bias: false
+native_amp_init_scale: 4294967296
+native_amp_growth_interval: 1000
+fp16_lm_cross_entropy: false
+seed: 1234
+use_cpu_initialization: false
+apex_transformer_log_level: 30
+data:
+  data_prefix: null
+  index_mapping_dir: null
+  data_impl: mmap
+  splits_string: 949,45,5
+  seq_length: 512
+  seq_length_dec: 128
+  skip_warmup: true
+  num_workers: 0
+  dataloader_type: single
+  masked_lm_prob: 0.15
+  dataset_type: t5
+  short_seq_prob: 0.0
+  max_ngram_size: 10
+  mean_ngram_size: null
+  geometric_dist: true
+  permutation: false
+  whole_word_masking: false
+  favor_longer_ngrams: false
+  respect_document_boundaries: true
+optim:
+  name: fused_adam
+  lr: 0.0001
+  betas:
+  - 0.9
+  - 0.999
+  eps: 1.0e-08
+  weight_decay: 0.01
+  sched:
+    name: WarmupAnnealing
+    min_lr: 1.0e-05
+    last_epoch: -1
+    warmup_ratio: 0.01
+precision: bf16
+target: nemo.collections.nlp.models.language_modeling.megatron_t5_model.MegatronT5Model
+nemo_version: 1.11.0rc0
+library: huggingface-t5v1_1 # options ['huggingface-t5v1_1', 'nemo-megatron']

nemo_checkpoints/megatron_ul2--val_loss=2.54-step=7000-consumed_samples=14557920.0.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cbce1edc1a23b6f3db975f8bf876ca9d32a3d86a0018a594fd96bfaffbcbf261
+size 7730365530

nemo_checkpoints/megatron_ul2--val_loss=6.59-step=150-consumed_samples=309920.0-last.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fbeabc36b95c4a017afbfae0baeefbca0a5bf3445a24bd6706c9ae7b22327df0
+size 7730362572

nemo_config/ul2-base-nl36/megatron.ul2-base-nl36.unigram-64k-pretok-small_data.all-clean.config.yaml ADDED Viewed

	@@ -0,0 +1,195 @@

+defaults:
+  - .@model.encoder: megatron_model_ul2base_config
+  - .@model.decoder: megatron_model_ul2base_config
+name: megatron_ul2
+restore_from_path: null # used when starting from a .nemo file
+trainer:
+  devices: 1
+  num_nodes: 1
+  accelerator: gpu
+  precision: 16
+  logger: False # logger provided by exp_manager
+  enable_checkpointing: False
+  replace_sampler_ddp: False
+  max_epochs: -1 # PTL default. In practice, max_steps will be reached first.
+  max_steps: 524288 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
+  log_every_n_steps: 100
+  val_check_interval: 1000
+  limit_val_batches: 30
+  limit_test_batches: 500
+  accumulate_grad_batches: 1
+  gradient_clip_val: 1.0
+exp_manager:
+  explicit_log_dir: null
+  exp_dir: /project/scratch/p200097/nemo_experiments/
+  name: megatron.ul2-base-nl36.unigram-64k-pretok-small_data.all-clean
+  create_wandb_logger: False
+  wandb_logger_kwargs:
+    project: null
+    name: null
+  resume_if_exists: True
+  resume_ignore_no_checkpoint: True
+  create_checkpoint_callback: True
+  checkpoint_callback_params:
+    monitor: val_loss
+    save_top_k: 10
+    mode: min
+    always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
+    filename: '${name}--{val_loss:.2f}-{step}-{consumed_samples}'
+    model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
+model:
+  # model parallelism
+  micro_batch_size: 10
+  # 4 GPUS * 24 nodes = 96 GPUS
+  # 96 GPUS * 7 micro_batch_size = 672 batch_size
+  # 672 * 3 = 2016 global_batch_size
+  global_batch_size: 2080 # will use more micro batches to reach global batch size
+  tensor_model_parallel_size: 1
+  pipeline_model_parallel_size: 1
+  resume_from_checkpoint: null # manually set the checkpoint file to load from
+  pipeline_model_parallel_split_rank: 0 # rank at which decoder starts.
+  # model architecture
+  make_vocab_size_divisible_by: 128 # Pad the vocab size to be divisible by this value for computation efficiency.
+  megatron_amp_O2: False # use AMP with O2 style mixed precision instead of native amp on-the-fly weight autocasting.
+  grad_allreduce_chunk_size_mb: 125
+  grad_div_ar_fusion: True # Fuse grad division into torch.distributed.all_reduce
+  gradient_as_bucket_view: True # Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory)
+  seq_length: 512
+  max_position_embeddings: ${.seq_length}
+  tokenizer:
+    library: 'huggingface'
+    type: 'KBLab/unigram-64k-pretok-small_data-tokenizer'
+    model: null
+    vocab_file: null
+    merge_file: null
+    num_sentinel_tokens: 256
+    sentencepiece_legacy: True # Legacy=True allows you to add special tokens to sentencepiece tokenizers.
+  # tokenizer:
+  #   library: 'megatron'
+  #   type: 'BertWordPieceCase'
+  #   model: null
+  #   vocab_file: null
+  #   merge_file: null
+  #   num_sentinel_tokens: 100
+  #   sentencepiece_legacy: True # Legacy=True allows you to add special tokens to sentencepiece tokenizers.
+  # weight init
+  embedding_init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
+  # embedding dropout
+  embedding_dropout: 0.1
+  # embedding sharing
+  share_token_embeddings: True # If True share encoder/decoder embeddings
+  share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits
+  # token head
+  tokens_head_bias: False
+  # precision
+  native_amp_init_scale: 4294967296 # 2 ** 32
+  native_amp_growth_interval: 1000
+  fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16
+  # miscellaneous
+  seed: 1234
+  use_cpu_initialization: False # Init weights on the CPU (slow for large models)
+  apex_transformer_log_level: 30 # Python logging level displays logs with severity greater than or equal to this
+  data:
+    # Path to data must be specified by the user.
+    # can override from the CLI: "model.data.data_prefix=[.5,/raid/data/pile/my-t5_00_text_document,.5,/raid/data/pile/my-t5_01_text_document]",
+    # Or see example below:
+    # data_prefix:
+    #   - .5
+    #   - /raid/data/pile/my-t5_00_text_document
+    #   - .5
+    #   - /raid/data/pile/my-t5_01_text_document
+    data_prefix:
+       - 0.005
+       - /project/scratch/p200097/data/unigram-64k-pretok-small_data/wikipedia-unigram-64k-pretok-small_data_text_sentence
+       - 0.035
+       - /project/scratch/p200097/data/unigram-64k-pretok-small_data/edepos_html-unigram-64k-pretok-small_data_text_sentence
+       - 0.030
+       - /project/scratch/p200097/data/unigram-64k-pretok-small_data/oscar-unigram-64k-pretok-small_data_text_sentence
+       - 0.105
+       - /project/scratch/p200097/data/unigram-64k-pretok-small_data/kw3-2017-unigram-64k-pretok-small_data_text_sentence
+       - 0.177
+       - /project/scratch/p200097/data/unigram-64k-pretok-small_data/issues-unigram-64k-pretok-small_data_text_sentence
+       - 0.648
+       - /project/scratch/p200097/data/unigram-64k-pretok-small_data/mc4-unigram-64k-pretok-small_data_text_sentence
+    index_mapping_dir: /project/scratch/p200097/data/unigram-64k-pretok-small_data/npy_files_ul2/ # path to save index mapping .npy files, by default will save in the same location as data_prefix
+    data_impl: mmap
+    # data_impl_kwargs: # currently used only for text_mmap, csv_mmap (should be data_impl dependant)
+    #     # defaults for text_memmap
+    #     newline_int: 10 # byte-value of newline (Use ord('\n') to get value)
+    #     header_lines: 0 # skip first N header lines
+    #     workers: null # number of workers when creating missing index files (null defaults to cpu_num // 2)
+    #     sort_dataset_paths: False # if True datasets will be sorted by name
+    #     # defaults for csv_memmap
+    #     newline_int: 10 # byte-value of newline
+    #     header_lines: 1 # skip first N header lines
+    #     workers: null # number of workers when creating missing index files (null defaults to cpu_num // 2)
+    #     sort_dataset_paths: False # if True datasets will be sorted by name
+    #     data_col: 1 # column to use for data
+    #     data_sep: ',' # string to split text into columns
+    splits_string: 996,2,2
+    seq_length: ${model.seq_length}
+    seq_length_dec: ${model.seq_length}
+    skip_warmup: True
+    num_workers: 32
+    dataloader_type: single # cyclic
+    masked_lm_prob: 0.15
+    extreme_masked_lm_prob: 0.5
+    dataset_type: 'ul2'
+    short_seq_prob: 0.0
+    max_ngram_size: 10
+    extreme_max_ngram_size: 128
+    extreme_min_ngram_size: 32
+    extreme_mean_ngram_size: 64
+    ngram_span_length_distribution: 'geometric'
+    extreme_ngram_span_length_distribution: 'truncated_normal'
+    prefix_lm_pivot_mean: 0.25
+    mean_ngram_size: 3
+    permutation: False
+    whole_word_masking: True
+    favor_longer_ngrams: False
+    respect_document_boundaries: True # If true, a single training exampl cannot cross document boundaries, increasing the fraction of <pad> tokens within a batch.
+  optim:
+    name: fused_adam
+    lr: 0.001
+    weight_decay: 0.01
+    betas:
+    - 0.9
+    - 0.999
+    eps: 1e-8
+    sched:
+      name: CosineAnnealing
+      warmup_steps: 1600
+      constant_steps: 30000 #40000
+      min_lr: 5e-6
+  # optim:
+  #   name: fused_adam
+  #   lr: 0.0001
+  #   betas:
+  #     - 0.9
+  #     - 0.999
+  #   eps: 1e-8
+  #   weight_decay: 0.01
+  #   sched:
+  #     name: WarmupAnnealing
+  #     min_lr: 0.00001
+  #     last_epoch: -1
+  #     warmup_ratio: 0.005

nemo_config/ul2-base-nl36/megatron_model_ul2base_config.yaml ADDED Viewed

	@@ -0,0 +1,40 @@

+num_layers: 36 # For perceiver models, this is the number of cross-attention blocks. Each layer has 1 cross-attention and "num_self_attention_per_cross_attention" self-attention layers.
+hidden_size: 768
+ffn_hidden_size: 2048 # Transformer FFN hidden size. Usually 4 * hidden_size. Since we use Swiglu, which uses extra projection weight matrix, we use 2/3 * 4 * ffn_hidden_size (see https://arxiv.org/abs/2002.05202)
+num_attention_heads: 12
+init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
+hidden_dropout: 0.0 # Dropout probability for hidden state transformer. "Dropout is set to 0 during pretraining" - UL2 paper
+attention_dropout: 0.0 # Dropout probability in the attention layer. "Dropout is set to 0 during pretraining" - UL2 paper
+ffn_dropout: 0.0 # Dropout probability in the feed-forward layer. "Dropout is set to 0 during pretraining" - UL2 paper
+position_embedding_type: 'relative' # Position embedding type. Options ['learned_absolute', 'relative', 'alibi']
+relative_attention_num_buckets: 32 # Relative position number of buckets for computing the bias
+relative_attention_max_distance: 128 # max_distance to keep relative distance in the attention_num_buckets.
+relative_position_bias_self_attention_only: True # whether to only use relative position bias for self attention only.
+kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null
+apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number.
+layernorm_epsilon: 1e-5
+persist_layer_norm: True # Use of persistent fused layer norm kernel.
+bias_activation_fusion: False # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function.
+grad_div_ar_fusion: True # Fuse grad division into torch.distributed.all_reduce
+masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
+bias_dropout_add_fusion: False # Use a kernel that fuses the bias addition, dropout and residual connection addition.
+bias: False # Whether to use bias terms in all weight matrices.
+normalization: 'rmsnorm' # Normalization layer to use. Options are 'layernorm', 'rmsnorm'
+arch: 'transformer' # Options: ['transformer', 'perceiver']
+activation: 'swiglu' # Options ['gelu', 'geglu', 'swiglu', 'reglu', 'squared-relu', 'fast-geglu', 'fast-swiglu', 'fast-reglu']
+headscale: False # Whether to learn extra parameters that scale the output of the each self-attention head.
+transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
+hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders
+num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer.
+openai_gelu: False # Use OpenAI's GELU instead of the default GeLU
+onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
+fp32_residual_connection: False # Use FP32 for residual connections.
+activations_checkpoint_method: null # 'uniform', 'block'
+activations_checkpoint_num_layers: 1
+activations_checkpoint_granularity: null # SELECTIVE: https://github.com/NVIDIA/NeMo/pull/4380
+megatron_legacy: False # Whether to use the legacy Megatron model. This affects the way q,k,v is partitioned from the mixed q,k,v layer in ParallelAttention. This needs to be True for models converted from HF.
+normalize_attention_scores: True # Whether to scale the output Q * K^T by 1 / sqrt(hidden_size_per_head). This arg is provided as a configuration option mostly for compatibility with models that have been weight-converted from HF. You almost always want to se this to True.
+num_moe_experts: 1 # When >1, FFNs are changed to MoE layers
+moe_frequency: 1 # every Nth ffn layer will be made MoE
+moe_dropout: 0.0 # Dropout value for MoE layers
+# https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/hf_t5v1_1_base_config.yaml

nemo_singularity.def ADDED Viewed

	@@ -0,0 +1,11 @@

+BootStrap: docker
+From: nvcr.io/nvidia/nemo:23.02
+%environment
+    export LC_ALL=C
+%post
+    cd /usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/data/language_modeling/megatron
+    make
+    pip install accelerate

test_ul2_hf.py ADDED Viewed

	@@ -0,0 +1,62 @@

+import torch
+from transformers import AutoTokenizer, T5ForConditionalGeneration, T5Tokenizer
+from nemo.collections.nlp.models.language_modeling.megatron_t5_model import MegatronT5Model
+from nemo.collections.nlp.data.language_modeling.megatron.ul2_dataset import UL2Dataset
+from pytorch_lightning.trainer.trainer import Trainer
+def load_nemo_megatron_model(checkpoint_path, devices=1, num_nodes=1, accelerator="gpu"):
+    trainer = Trainer(devices=devices, num_nodes=num_nodes, accelerator=accelerator)
+    model = MegatronT5Model.load_from_checkpoint(checkpoint_path, trainer=trainer)
+    return model
+#### Huggingface ####
+tokenizer = AutoTokenizer.from_pretrained("ul2-base-nl36-swedish")
+model = T5ForConditionalGeneration.from_pretrained("ul2-base-nl36-swedish")
+# "Hunden bet mannen i" means "The dog bit the man in".
+input_ids = tokenizer(
+    "<extra_id_r> Hunden bet mannen i <extra_id_0>", return_tensors="pt", return_token_type_ids=False
+)
+# Predict with HF
+with torch.no_grad():
+    outputs_hf = model(
+        input_ids=input_ids.input_ids,
+        attention_mask=input_ids.attention_mask,
+        decoder_input_ids=input_ids.input_ids,
+        decoder_attention_mask=input_ids.attention_mask,
+    )
+# Argmax to get the most probable token id
+output_tokens_hf = outputs_hf[0].argmax(dim=-1)
+#### Nemo ####
+model_nemo = load_nemo_megatron_model("nemo_checkpoints/megatron_ul2--val_loss=2.54-step=7000-consumed_samples=14557920.0.ckpt")
+model_nemo.eval()
+tokenizer_nemo = model_nemo.tokenizer.tokenizer
+input_ids_nemo = tokenizer_nemo("<extra_id_r> Hunden bet mannen i <extra_id_0>", return_tensors="pt").to("cuda")
+# Predict with Nemo
+with torch.no_grad():
+    outputs_nemo = model_nemo(
+        encoder_input_ids=input_ids_nemo.input_ids,
+        decoder_input_ids=input_ids_nemo.input_ids,
+        encoder_attn_mask=input_ids_nemo.attention_mask,
+        decoder_attn_mask=input_ids_nemo.attention_mask,
+    )
+# Argmax to get the most probable token
+output_tokens = outputs_nemo.argmax(dim=-1)
+#### Compare both outputs ####
+print(f"Nemo logits: {outputs_nemo[0]}")
+print(f"Huggingface logits: {outputs_hf[0]}")
+print(f"Are logits equal: {torch.allclose(outputs_nemo[0], outputs_hf[0].to('cuda'))}")
+# Decode tokens
+print(f"Huggingface output: {tokenizer.batch_decode(output_tokens_hf)}")
+print(f"Nemo output: {tokenizer_nemo.batch_decode(output_tokens)}")  # Reasonable output for undertrained model