Faton Rekathati commited on
Commit
0fd282e
1 Parent(s): e8619ad

UL2 conversion instructions

Browse files
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Checkpoints and conversion scripts for Nemo cpkt files to Huggingface
2
+
3
+ This repo contains two checkpoints (`.ckpt` files) for UL2 models we have started pretraining with Nemo. The checkpoints are found in `nemo_checkpoints/`. The Nemo config files used to train these models can be found in `nemo_config/ul2-base-nl36`.
4
+
5
+ `megatron_ul2--val_loss=2.54-step=7000-consumed_samples=14557920.0.ckpt` was trained with `megatron_legacy: False` in the config, whereas the other checkpoint was trained with `megatron_legacy: True`.
6
+
7
+ Nvidia have created a conversion script that converts T5, T5v1.1 and UL2 models on Huggingface Hub to Nemo format. The script can be found [here](https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/hf_t5-v1_1_to_nemo.py). It is also included in this repo.
8
+
9
+ We thought that adapting a T5/UL2 model trained with Nemo to a Huggingface format would simply be a manner of reversing the conversion that was performed by the script above. Our conversion script does work assuming we operate directly on the `pt` state dict weight files produced by running the above Nvidia script. I.e. it works when going directly `Huggingface -> Nemo -> Huggingface`. However, it does not work when attempting to go `Nemo -> Huggingface`. An UL2 model that was initialized with Nemo Megatron, and pretrained with Nemo, does not produce same output when converted to Huggingface format.
10
+
11
+ ### Dependencies
12
+
13
+ We use Nemo docker containers (tag `23.02`) via Singularity when running the code in this repo. We have included a definition file to build the container.
14
+
15
+ To build the container:
16
+
17
+ ```bash
18
+ sudo singularity build nemo2302.sif nemo_singularity.def
19
+ ```
20
+
21
+ We provide bash scripts to execute with singularity. However, to debug easier you can also run singularity in interactive mode via:
22
+
23
+ ```bash
24
+ singularity shell --nv nemo2302.sif
25
+ ```
26
+
27
+ ### Converting Nemo checkpoints to Huggingface
28
+
29
+ We have included our conversion script in this repo. It can be found in `convert_nemo_ul2_checkpoint.py`.
30
+
31
+ We manually created a Huggingface config file for UL2 that to the best of our knowledge matches the settings used when we trained with Nemo (see `config_ul2_base_nl36.json`).
32
+
33
+ To replicate our weights conversion, simply run:
34
+
35
+ ```bash
36
+ singularity exec --nv nemo2302.sif bash convert_nemo_to_hf.sh
37
+ ```
38
+
39
+ The resulting Huggingface model will be saved to `ul2-base-nl36-swedish/`.
40
+
41
+ We are aware that [Megatron-LM uses different ordering of QKV](https://github.com/NVIDIA/Megatron-LM/blob/42c1cf4279acea5a554500dcb552211f44cbec45/megatron/checkpointing.py#L209-L237) in the attention layers depending on the version of Megatron-LM used. We are also aware of an existing conversion script that Huggingface have created for converting Megatron-BERT to Huggingface, where they adapt the ordering of QKV in Megatron to [match the ordering used in Huggingface](https://github.com/NVIDIA/Megatron-LM/blob/42c1cf4279acea5a554500dcb552211f44cbec45/megatron/checkpointing.py#L209-L237). As such we have an optional `--fix_qkv` parameter in our conversion script that applies the same reordering of QKV as Huggingface does. See the lines that are commented out in `convert_nemo_to_hf.sh` for an example of how to use this parameter and set the `checkpoint_version`.
42
+
43
+ Unfortunately, none of the above solves the issue we have with the conversion script.
44
+
45
+ We have a test script that predicts both with the original Nemo model and with the converted Huggingface model. The output unfortunately isn't the same. We used the same identical tokenizer for both models. To run:
46
+
47
+ ```bash
48
+ singularity exec --nv nemo2302.sif python test_ul2_hf.py
49
+ ```
50
+
51
+ Or explore in interactive mode with `singularity shell --nv nemo2302.sif`.
52
+
53
+ ### Confirming the conversion script can reverse Nvidia's conversion script
54
+
55
+ In order to confirm the conversion script is valid enough in the sense that it is able to reverse Nvidia's conversion script, we here include instructions to convert a UL2 model from Huggingface to Nemo, via Nvidia's conversion script, and then back to Huggingface via our conversion script.
56
+
57
+ Instructions:
58
+
59
+ 1. Run `singularity exec --nv nemo2302.sif bash convert_hf_to_nemo.sh` to convert the existing [Finnish-NLP/ul2-base-nl36-finnish](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) from Huggingface to Nemo format via Nvidia's conversion script. The resultning model weights will be saved to the folder `ul2-base-nl36-finnish/`.
60
+ 2. To perform the reverse conversion, and to perform a check whether the re-converted weights are identical, run `python convert_finnish_ul2_model.py`. Or via singularity: `singularity exec --nv nemo2302.sif python convert_finnish_ul2_model.py`.
61
+
62
+ The resuling model re-converted to Huggingface will be found in `ul2-base-nl36-finnish/hf_t5_ul2`.
63
+
64
+ This conversion produces a model that is identical to the original model.
config_ul2_base_nl36.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./",
3
+ "architectures": [
4
+ "T5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 2048,
7
+ "d_kv": 64,
8
+ "d_model": 768,
9
+ "decoder_start_token_id": 0,
10
+ "dense_act_fn": "silu",
11
+ "dropout_rate": 0.1,
12
+ "eos_token_id": 1,
13
+ "feed_forward_proj": "gated-silu",
14
+ "initializer_factor": 1.0,
15
+ "is_encoder_decoder": true,
16
+ "is_gated_act": true,
17
+ "layer_norm_epsilon": 1e-06,
18
+ "model_type": "t5",
19
+ "n_positions": 512,
20
+ "num_decoder_layers": 36,
21
+ "num_heads": 12,
22
+ "num_layers": 36,
23
+ "output_past": true,
24
+ "pad_token_id": 0,
25
+ "relative_attention_max_distance": 128,
26
+ "relative_attention_num_buckets": 32,
27
+ "tie_word_embeddings": true,
28
+ "torch_dtype": "float16",
29
+ "transformers_version": "4.26.1",
30
+ "use_cache": true,
31
+ "vocab_size": 64384
32
+ }
config_ul2_finnish.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./",
3
+ "architectures": [
4
+ "T5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 3072,
7
+ "d_kv": 64,
8
+ "d_model": 768,
9
+ "decoder_start_token_id": 0,
10
+ "dense_act_fn": "gelu_new",
11
+ "dropout_rate": 0.1,
12
+ "eos_token_id": 1,
13
+ "feed_forward_proj": "gated-gelu",
14
+ "initializer_factor": 1.0,
15
+ "is_encoder_decoder": true,
16
+ "is_gated_act": true,
17
+ "layer_norm_epsilon": 1e-06,
18
+ "model_type": "t5",
19
+ "n_positions": 512,
20
+ "num_decoder_layers": 36,
21
+ "num_heads": 12,
22
+ "num_layers": 36,
23
+ "output_past": true,
24
+ "pad_token_id": 0,
25
+ "relative_attention_max_distance": 128,
26
+ "relative_attention_num_buckets": 32,
27
+ "tie_word_embeddings": false,
28
+ "torch_dtype": "float32",
29
+ "transformers_version": "4.22.1",
30
+ "use_cache": true,
31
+ "vocab_size": 32128
32
+ }
convert_finnish_ul2_model.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch
3
+ from convert_nemo_ul2_checkpoint import convert_nemo_to_hf
4
+ from transformers import T5ForConditionalGeneration, AutoTokenizer
5
+
6
+ #### Step 1: Convert the original HF model which was converted to NEMO back to HF weights
7
+ nemo_weights = torch.load("ul2-base-nl36-finnish/nemo_state_dict.pt")
8
+ hf_weights = convert_nemo_to_hf(nemo_weights)
9
+
10
+ #### Step 2: Load original HF model and save its config/tokenizer in local folder
11
+ hf_model = T5ForConditionalGeneration.from_pretrained("Finnish-NLP/ul2-base-nl36-finnish")
12
+ tokenizer = AutoTokenizer.from_pretrained("Finnish-NLP/ul2-base-nl36-finnish")
13
+
14
+ # Save tokenizer in ul2-base-nl36-finnish
15
+ tokenizer.save_pretrained("ul2-base-nl36-finnish/hf_t5_ul2")
16
+
17
+ # Save config in ul2-base-nl36-finnish
18
+ hf_model.config.save_pretrained("ul2-base-nl36-finnish/hf_t5_ul2")
19
+
20
+ #### Step 3: Save our converted weights to the local folder
21
+ # Save converted model weights in ul2-base-nl36-finnish
22
+ torch.save(hf_weights, os.path.join("ul2-base-nl36-finnish/hf_t5_ul2", "pytorch_model.bin"))
23
+
24
+
25
+ #### Step4: Load the converted model from local folder and check whether weights are the same
26
+ converted_model = T5ForConditionalGeneration.from_pretrained("ul2-base-nl36-finnish/hf_t5_ul2")
27
+
28
+ equal = []
29
+ for key in hf_model.state_dict().keys():
30
+ print(key)
31
+ print(torch.allclose(hf_model.state_dict()[key], converted_model.state_dict()[key]))
32
+ equal.append(torch.allclose(hf_model.state_dict()[key], converted_model.state_dict()[key]))
33
+
34
+
35
+ print(f"All weights are equal: {all(equal)}")
convert_hf_to_nemo.sh ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # singularity exec --nv nemo2302 bash convert_hf_to_nemo.sh
2
+
3
+ OUTPUT_FOLDER=ul2-base-nl36-finnish
4
+ mkdir -p $OUTPUT_FOLDER
5
+
6
+ python hf_t5_v1_1_to_nemo.py \
7
+ --hf_model_name Finnish-NLP/ul2-base-nl36-finnish \
8
+ --nemo_state_dict $OUTPUT_FOLDER/nemo_state_dict.pt \
9
+ --nemo_file_path $OUTPUT_FOLDER/nemo_file.nemo
convert_nemo_to_hf.sh ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # singularity exec --nv nemo2302 bash convert_nemo_to_hf.sh
2
+
3
+
4
+ #### Convert model pretrained from scratch in Nemo Meagtron to HuggingFace format
5
+ python convert_nemo_ul2_checkpoint.py \
6
+ --nemo_model_path=nemo_checkpoints/megatron_ul2--val_loss=2.54-step=7000-consumed_samples=14557920.0.ckpt \
7
+ --hf_config_path=config_ul2_base_nl36.json \
8
+ --output_path=ul2-base-nl36-swedish \
9
+ --hidden_size=768 \
10
+ # --num_heads=12 \
11
+ # --kv_dim=64 \
12
+ # --checkpoint_version=2.0 \
13
+ # --fix_qkv \
14
+ # --hf_model_path=ul2_base_nl36 \
convert_nemo_ul2_checkpoint.py ADDED
@@ -0,0 +1,550 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
2
+ # Copyright (c) 2023, KBLab at the National Library of Sweden. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+
17
+ """
18
+ Script to convert NeMo Megatron T5/UL2 model to Huggingface T5 model.
19
+ Based off of NVIDIA's conversion script at: https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/hf_t5-v1_1_to_nemo.py .
20
+ We reverse their conversion process.
21
+
22
+ NOTE: You may want to double check the conversion if you are using a custom config with shared_decoder_tokens_head_embeddings=False.
23
+ """
24
+
25
+ import argparse
26
+ import os
27
+ import collections
28
+ import sys
29
+
30
+ import torch
31
+ from nemo.collections.nlp.models.language_modeling.megatron_t5_model import MegatronT5Model
32
+ from omegaconf.omegaconf import OmegaConf
33
+ from pytorch_lightning.trainer.trainer import Trainer
34
+ from transformers import AutoTokenizer, T5Config, T5ForConditionalGeneration
35
+
36
+ # Make hidden_size, num_heads, kv_dim configurable as args with argparse
37
+
38
+
39
+ def load_nemo_megatron_model(checkpoint_path, devices=1, num_nodes=1, accelerator="gpu"):
40
+ trainer = Trainer(devices=devices, num_nodes=num_nodes, accelerator=accelerator)
41
+ model = MegatronT5Model.load_from_checkpoint(checkpoint_path, trainer=trainer)
42
+
43
+ return model
44
+
45
+
46
+ def load_huggingface_t5_model(model_config_path):
47
+ """
48
+ # You need to configure config yourself based on your hparams during training
49
+ # See examples of UL2 hugginface configs:
50
+ # https://huggingface.co/google/flan-ul2/blob/main/config.json
51
+ # https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish/blob/main/config.json
52
+ """
53
+ t5_config = T5Config.from_pretrained(model_config_path)
54
+ t5_model = T5ForConditionalGeneration(t5_config)
55
+
56
+ return t5_model
57
+
58
+
59
+ def _get_model_type_block_layer_hf(k):
60
+ """
61
+ Get info from Huggingface model block and layer names
62
+
63
+ Returns model_type, block number, layer number.
64
+ """
65
+ if k.startswith("encoder"):
66
+ model_type = "encoder"
67
+ elif k.startswith("decoder"):
68
+ model_type = "decoder"
69
+ else:
70
+ raise ValueError(f"Unknown model type for {k}")
71
+ return model_type, int(k.split(".")[2]), int(k.split(".")[4])
72
+
73
+
74
+ def _get_model_type_layer_nemo(k):
75
+ """
76
+ Get info from NeMo layer names.
77
+
78
+ Returns model_type, layer number.
79
+ 5th element in the split is the layer number.
80
+ """
81
+ print(k)
82
+ if "encoder" in k:
83
+ model_type = "encoder"
84
+ elif "decoder" in k:
85
+ model_type = "decoder"
86
+ else:
87
+ raise ValueError(f"Unknown model type for {k}")
88
+ return model_type, int(k.split(".")[5])
89
+
90
+
91
+ def fix_query_key_value_ordering(param, checkpoint_version, num_splits, num_heads, hidden_size):
92
+ # Permutes layout of param tensor to [num_splits * num_heads * hidden_size, :]
93
+ # for compatibility with later versions of NVIDIA Megatron-LM.
94
+ # The inverse operation is performed inside Megatron-LM to read checkpoints:
95
+ # https://github.com/NVIDIA/Megatron-LM/blob/v2.4/megatron/checkpointing.py#L209
96
+ # If param is the weight tensor of the self-attention block, the returned tensor
97
+ # will have to be transposed one more time to be read by HuggingFace BERT.
98
+ input_shape = param.size()
99
+ if checkpoint_version == 1.0:
100
+ # version 1.0 stores [num_heads * hidden_size * num_splits, :]
101
+ saved_shape = (num_heads, hidden_size, num_splits) + input_shape[1:]
102
+ param = param.view(*saved_shape)
103
+ param = param.transpose(0, 2)
104
+ param = param.transpose(1, 2).contiguous()
105
+ elif checkpoint_version >= 2.0:
106
+ # other versions store [num_heads * num_splits * hidden_size, :]
107
+ saved_shape = (num_heads, num_splits, hidden_size) + input_shape[1:]
108
+ param = param.view(*saved_shape)
109
+ param = param.transpose(0, 1).contiguous()
110
+ param = param.view(*input_shape)
111
+ return param
112
+
113
+
114
+ def convert_nemo_to_hf(
115
+ nemo_weights, fix_qkv_ordering=False, hidden_size=768, num_heads=12, kv_dim=64, checkpoint_version=2.0
116
+ ):
117
+ """
118
+ Convert NeMo Megatron T5/UL2 model to Huggingface T5 model.
119
+
120
+ Args:
121
+ nemo_weights (dict): NeMo model weights (state dict).
122
+ fix_qkv_ordering (bool): Whether to fix the query, key, value ordering in the self-attention blocks.
123
+ hidden_size (int): Hidden size of the model.
124
+ num_heads (int): Number of attention heads.
125
+ kv_dim (int): Projection weights dimension in multi-head attention. Generally: hidden_size // num_heads.
126
+ checkpoint_version (float): Megatron checkpoint version (No idea how to get this from the checkpoint itself).
127
+
128
+ Returns:
129
+ hf_weights (dict): Huggingface model weights (state dict).
130
+ """
131
+ print(f"Found {len(nemo_weights.keys())} keys in the NeMo checkpoint")
132
+
133
+ hf_weights = collections.OrderedDict()
134
+
135
+ for k, v in nemo_weights.items():
136
+ #################################################
137
+ ###### Enc-Dec Embeddings and Output Layer ######
138
+ #################################################
139
+ # Tied decoder embedding and decoder output layer.
140
+ if k == "enc_dec_model.decoder_embedding.word_embeddings.weight":
141
+ # shared.weight, lm_head.weight, decoder.embed_tokens.weight and encoder.embed_tokens.weight
142
+ # are the same in HF when tied_word_embeddings=True in T5Config.
143
+ # Corresponding setting in NeMo config: share_decoder_tokens_head_embeddings=True (share decoder vocab embeddings and decoder LM Head)
144
+ # and share_token_embeddings=True (share encoder/decoder vocab embeddings).
145
+ # Shared decoder embeddings and LM head yield best result according to: https://aclanthology.org/2021.emnlp-main.465.pdf#page=7 .
146
+
147
+ # Check if encoder and decoder token embeddings are the same.
148
+ is_shared_encdec = torch.allclose(
149
+ v, nemo_weights["enc_dec_model.encoder_embedding.word_embeddings.weight"]
150
+ )
151
+ if is_shared_encdec:
152
+ print("Found shared encoder and decoder embeddings")
153
+ hf_weights["shared.weight"] = v
154
+ else:
155
+ ValueError(
156
+ (
157
+ f"Found separate encoder and decoder embeddings in NeMo checkpoint. \n"
158
+ f"Not supported in T5 HF implementation. \n"
159
+ f"You should probably set 'share_token_embeddings' to True in your NeMo config. \n"
160
+ )
161
+ )
162
+
163
+ if k == "enc_dec_model.tokens_head.weight":
164
+ # This weight doesn't seem to exist in Nemo when share_decoder_tokens_head_embeddings=True.
165
+ # Don't worry though. If you set tie_word_embeddings=True in HF, this weight will be
166
+ # created automatically when loading the model in HF and tied to
167
+ # shared.weight / decoder.embed_tokens.weight.
168
+ hf_weights["lm_head.weight"] = v
169
+ print(f"Mapped {k} to lm_head.weight")
170
+
171
+ elif k == "enc_dec_model.tokens_head.bias":
172
+ # HF doesn't have a bias for lm_head.weight
173
+ ValueError(
174
+ (
175
+ f"Found bias for lm_head.weight in NeMo checkpoint. This is not supported in HF T5 implementation. \n"
176
+ f"You should probably set 'tokens_head_bias' to False in your NeMo config. \n"
177
+ f"If your checkpoint is from older version of Megatron, you may also need to set 'share_decoder_tokens_head_embeddings' to False in NeMo config. \n"
178
+ f"See: https://github.com/NVIDIA/NeMo/blob/557c4b7ae766faf050374e6b9a862e2e67385b10/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py#L231-L236"
179
+ )
180
+ )
181
+ # hf_weights["lm_head.bias"] = v
182
+ # print(f"Mapped {k} to lm_head.bias")
183
+
184
+ # Decoder embeddings
185
+ elif k == "enc_dec_model.decoder_embedding.word_embeddings.weight":
186
+ hf_weights["decoder.embed_tokens.weight"] = v
187
+
188
+ elif k == "enc_dec_model.encoder_embedding.word_embeddings.weight":
189
+ hf_weights["encoder.embed_tokens.weight"] = v
190
+ print(f"Mapped {k} to encoder.embed_tokens.weight")
191
+
192
+ #################################################
193
+ ################# RPE Weights ###################
194
+ #################################################
195
+
196
+ elif k == "enc_dec_model.encoder_relative_position_embedding.relative_position_embedding.weight":
197
+ hf_weights["encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"] = v
198
+ print(f"Mapped {k} to encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight")
199
+ elif k == "enc_dec_model.decoder_relative_position_embedding.relative_position_embedding.weight":
200
+ hf_weights["decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"] = v
201
+ print(f"Mapped {k} to decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight")
202
+
203
+ #################################################
204
+ #################$ LayerNorm ####################
205
+ #################################################
206
+
207
+ # Block in HF corresponds to layer in NeMo.
208
+ # Layer in HF does not correspond to anything in NeMo.
209
+ # In Huggingface: Layer 0 is input layer norm, layer 1 is layer norm on self attn output,
210
+ # layer 2 is layer norm for cross attn output in decoder.
211
+
212
+ # In NeMo, some layernorm layers (final layernorms) don't have layer number in the name.
213
+ # We take care of these early so _get_model_type_layer_nemo function doesn't fail.
214
+
215
+ elif "layernorm" in k:
216
+ if "final" in k:
217
+ model_type = "encoder" if "encoder" in k else "decoder"
218
+
219
+ # Layer 2 in HF is always FFN + LayerNorm
220
+ hf_weights[f"{model_type}.final_layer_norm.weight"] = v
221
+ print(f"Mapped {k} to {model_type}.final_layer_norm.weight")
222
+
223
+ # if "bias" in k:
224
+ # hf_weights[f"{model_type}.block.final_layer_norm.bias"] = v
225
+ # print(f"Mapped {k} to {model_type}.block.final_layer_norm.bias")
226
+
227
+ else:
228
+ model_type, layer_number = _get_model_type_layer_nemo(k)
229
+
230
+ if "input_layernorm" in k and model_type == "encoder":
231
+ # Input layer norm is always layer 0 in HF
232
+ hf_weights[f"encoder.block.{layer_number}.layer.0.layer_norm.weight"] = v
233
+ print(f"Mapped {k} to encoder.block.{layer_number}.layer.0.layer_norm.weight")
234
+
235
+ # if "bias" in k:
236
+ # hf_weights[f"encoder.block.{layer_number}.layer.0.layer_norm.bias"] = v
237
+ # print(f"Mapped {k} to encoder.block.{layer_number}.layer.0.layer_norm.bias")
238
+
239
+ elif "post_attention_layernorm" in k and model_type == "encoder":
240
+ # Layer 1 in HF is layer norm for self attn output
241
+ hf_weights[f"{model_type}.block.{layer_number}.layer.1.layer_norm.weight"] = v
242
+ print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.layer_norm.weight")
243
+
244
+ # if "bias" in k:
245
+ # hf_weights[f"{model_type}.block.{layer_number}.layer.1.layer_norm.bias"] = v
246
+ # print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.layer_norm.bias")
247
+
248
+ elif "input_layernorm" in k and model_type == "decoder":
249
+ # Input layer norm is always layer 0 in HF
250
+ hf_weights[f"decoder.block.{layer_number}.layer.0.layer_norm.weight"] = v
251
+ print(f"Mapped {k} to decoder.block.{layer_number}.layer.0.layer_norm.weight")
252
+
253
+ # if "bias" in k:
254
+ # hf_weights[f"decoder.block.{layer_number}.layer.0.layer_norm.bias"] = v
255
+ # print(f"Mapped {k} to decoder.block.{layer_number}.layer.0.layer_norm.bias")
256
+
257
+ elif "post_attention_layernorm" in k and model_type == "decoder":
258
+ # Layer 1 in HF is layer norm for self attn output
259
+ hf_weights[f"{model_type}.block.{layer_number}.layer.1.layer_norm.weight"] = v
260
+ print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.layer_norm.weight")
261
+
262
+ # if "bias" in k:
263
+ # hf_weights[f"{model_type}.block.{layer_number}.layer.1.layer_norm.bias"] = v
264
+ # print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.layer_norm.bias")
265
+
266
+ elif "post_inter_attention_layernorm" in k and model_type == "decoder":
267
+ # Layer 2 in HF is layer norm for cross attn output
268
+ hf_weights[f"{model_type}.block.{layer_number}.layer.2.layer_norm.weight"] = v
269
+ print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.2.layer_norm.weight")
270
+
271
+ # if "bias" in k:
272
+ # hf_weights[f"{model_type}.block.{layer_number}.layer.2.layer_norm.bias"] = v
273
+ # print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.2.layer_norm.bias")
274
+ else:
275
+ raise ValueError("Unknown layer_norm key: {}".format(k))
276
+
277
+ #################################################
278
+ ############### Attention Layers ################
279
+ #################################################
280
+
281
+ # Self-Attention
282
+
283
+ # Q, k, V in NeMo-Megatron is bundled into a single matrix.
284
+ elif "self_attention.query_key_value.weight" in k:
285
+ # Example naming in HF:
286
+ # encoder.block.0.layer.0.SelfAttention.q.weight
287
+ # decoder.block.0.layer.0.SelfAttention.q.weight
288
+
289
+ # Model type is either "encoder" or "decoder"
290
+ model_type, layer_number = _get_model_type_layer_nemo(k)
291
+
292
+ if fix_qkv_ordering:
293
+ out_val = fix_query_key_value_ordering(
294
+ v, checkpoint_version=checkpoint_version, num_splits=3, num_heads=num_heads, hidden_size=kv_dim
295
+ )
296
+ else:
297
+ out_val = v
298
+
299
+ q_weights = out_val[0 * hidden_size : 1 * hidden_size, :]
300
+ k_weights = out_val[1 * hidden_size : 2 * hidden_size, :]
301
+ v_weights = out_val[2 * hidden_size : 3 * hidden_size, :]
302
+
303
+ # Layer 0 in HF is always self attn
304
+ hf_weights[f"{model_type}.block.{layer_number}.layer.0.SelfAttention.q.weight"] = q_weights
305
+ hf_weights[f"{model_type}.block.{layer_number}.layer.0.SelfAttention.k.weight"] = k_weights
306
+ hf_weights[f"{model_type}.block.{layer_number}.layer.0.SelfAttention.v.weight"] = v_weights
307
+
308
+ print(
309
+ (
310
+ f"Mapped {k} to: \n",
311
+ f"{model_type}.block.{layer_number}.layer.0.SelfAttention.q.weight \n",
312
+ f"{model_type}.block.{layer_number}.layer.0.SelfAttention.k.weight \n",
313
+ f"{model_type}.block.{layer_number}.layer.0.SelfAttention.v.weight \n",
314
+ )
315
+ )
316
+
317
+ # If we trained with bias=True in NeMo we will have bias terms for all weight matrices.
318
+ # Huggingface doesn't support optional bias terms in their T5 implementation.
319
+ elif "self_attention.query_key_value.bias" in k:
320
+ ValueError(
321
+ "Bias terms for most weights are not supported in Huggingface T5. Train with bias=False in NeMo config."
322
+ )
323
+
324
+ # Output self-attn matrix.
325
+ elif "self_attention.dense.weight" in k:
326
+ model_type, layer_number = _get_model_type_layer_nemo(k)
327
+ # Layer 0 in HF still always self attn
328
+ hf_weights[f"{model_type}.block.{layer_number}.layer.0.SelfAttention.o.weight"] = v
329
+ print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.0.SelfAttention.o.weight")
330
+
331
+ # Cross-Attention projection matrices are merged into K, V matrices in NeMo-Megatron.
332
+ # Need to split them into K, V matrices in HF.
333
+ elif "inter_attention.key_value.weight" in k:
334
+ model_type, layer_number = _get_model_type_layer_nemo(k)
335
+
336
+ if fix_qkv_ordering:
337
+ out_val = fix_query_key_value_ordering(
338
+ v, checkpoint_version=checkpoint_version, num_splits=2, num_heads=num_heads, hidden_size=kv_dim
339
+ )
340
+ else:
341
+ out_val = v
342
+
343
+ # Layer 1 in HF is always cross attn
344
+ k_weights = out_val[0 * hidden_size : 1 * hidden_size, :]
345
+ v_weights = out_val[1 * hidden_size : 2 * hidden_size, :]
346
+ hf_weights[f"decoder.block.{layer_number}.layer.1.EncDecAttention.k.weight"] = k_weights
347
+ hf_weights[f"decoder.block.{layer_number}.layer.1.EncDecAttention.v.weight"] = v_weights
348
+ print(
349
+ (
350
+ f"Mapped {k} to: \n",
351
+ f"decoder.block.{layer_number}.layer.1.EncDecAttention.k.weight \n",
352
+ f"decoder.block.{layer_number}.layer.1.EncDecAttention.v.weight \n",
353
+ )
354
+ )
355
+
356
+ # Cross-Attention Q matrix is separate in NeMo-Megatron and HF.
357
+ elif "inter_attention.query.weight" in k:
358
+ model_type, layer_number = _get_model_type_layer_nemo(k)
359
+ # Layer 1 in HF is always cross attn
360
+ hf_weights[f"decoder.block.{layer_number}.layer.1.EncDecAttention.q.weight"] = v
361
+ print(f"Mapped {k} to decoder.block.{layer_number}.layer.1.EncDecAttention.q.weight")
362
+
363
+ # Output cross-attention matrix.
364
+ elif "inter_attention.dense.weight" in k:
365
+ model_type, layer_number = _get_model_type_layer_nemo(k)
366
+ # Layer 1 in HF is always cross attn
367
+ hf_weights[f"decoder.block.{layer_number}.layer.1.EncDecAttention.o.weight"] = v
368
+ print(f"Mapped {k} to decoder.block.{layer_number}.layer.1.EncDecAttention.o.weight")
369
+
370
+ #################################################
371
+ #################$ FFN Layers ###################
372
+ #################################################
373
+
374
+ elif "mlp.dense_h_to_4h.weight" in k:
375
+ model_type, layer_number = _get_model_type_layer_nemo(k)
376
+
377
+ if model_type == "encoder":
378
+ # FFN + LayerNorm is always layer 1 in HF encoder attention blocks.
379
+ hf_weights[f"{model_type}.block.{layer_number}.layer.1.DenseReluDense.wi_0.weight"] = v
380
+ print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.DenseReluDense.wi_0.weight")
381
+ elif model_type == "decoder":
382
+ # FFN + LayerNorm is always layer 2 in HF decoder attention blocks.
383
+ hf_weights[f"{model_type}.block.{layer_number}.layer.2.DenseReluDense.wi_0.weight"] = v
384
+ print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.2.DenseReluDense.wi_0.weight")
385
+
386
+ elif "mlp.dense_h_to_4h_2.weight" in k:
387
+ model_type, layer_number = _get_model_type_layer_nemo(k)
388
+
389
+ if model_type == "encoder":
390
+ # FFN + LayerNorm is always layer 1 in HF encoder attention blocks.
391
+ hf_weights[f"{model_type}.block.{layer_number}.layer.1.DenseReluDense.wi_1.weight"] = v
392
+ print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.DenseReluDense.wi_1.weight")
393
+ elif model_type == "decoder":
394
+ # FFN + LayerNorm is always layer 2 in HF decoder attention blocks.
395
+ hf_weights[f"{model_type}.block.{layer_number}.layer.2.DenseReluDense.wi_1.weight"] = v
396
+ print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.2.DenseReluDense.wi_1.weight")
397
+
398
+ elif "mlp.dense_4h_to_h.weight" in k:
399
+ model_type, layer_number = _get_model_type_layer_nemo(k)
400
+ # Layer 2 in HF is always FFN + LayerNorm
401
+ if model_type == "encoder":
402
+ # FFN + LayerNorm is always layer 1 in HF encoder attention blocks.
403
+ hf_weights[f"{model_type}.block.{layer_number}.layer.1.DenseReluDense.wo.weight"] = v
404
+ print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.1.DenseReluDense.wo.weight")
405
+ elif model_type == "decoder":
406
+ # FFN + LayerNorm is always layer 2 in HF decoder attention blocks.
407
+ hf_weights[f"{model_type}.block.{layer_number}.layer.2.DenseReluDense.wo.weight"] = v
408
+ print(f"Mapped {k} to {model_type}.block.{layer_number}.layer.2.DenseReluDense.wo.weight")
409
+
410
+ else:
411
+ raise ValueError(f"Unknown key: {k}")
412
+
413
+ print("Done mapping weights. \n")
414
+ print(f"Total keys in converted Huggingface weight mapping: {len(hf_weights.keys())} \n")
415
+ return hf_weights
416
+
417
+
418
+ # singularity shell --nv data/nemo2302
419
+
420
+
421
+ def compare_weights_hf_nemo(model, hf_weights, hf_config_path, hf_model_path=None):
422
+ """
423
+ Compares the weights of a Huggingface initialized model against Nemo model converted to HF.
424
+ Prints if there are any missing keys that were expected but not mapped.
425
+ Also compares parameter count of HF initialized model against original unconverted Nemo model.
426
+
427
+ Args:
428
+ model: NeMo model
429
+ hf_weights: Dictionary of Huggingface weights
430
+ hf_config_path: Path to Huggingface config file to initialize model from.
431
+ hf_model_path: Path to Huggingface Hub or local HF model folder, if you alternatively want to
432
+ load/initialize from an existing model on HF Hub or disk (optional)
433
+ """
434
+
435
+ if args.hf_model_path:
436
+ # If user supplies a HF hub model path, or local converted model, we load the model from there.
437
+ hf_model = T5ForConditionalGeneration.from_pretrained(hf_model_path)
438
+ else:
439
+ # Otherwise, we load the model from the config.
440
+ hf_model = load_huggingface_t5_model(hf_config_path)
441
+
442
+ print(f"Total keys in converted Huggingface weight mapping: {len(hf_weights.keys())} \n")
443
+ print(f"Total keys in Huggingface model initialized from config or HF Hub: {len(hf_model.state_dict().keys())} \n")
444
+
445
+ # Count the number of parameters in the model
446
+ print(
447
+ f"Number of parameters in HF model initialized from config or HF hub: {sum(p.numel() for p in hf_model.parameters() if p.requires_grad)}"
448
+ )
449
+ # Number of parameters in Nemo model
450
+ print(f"Number of parameters in Nemo model: {sum(p.numel() for p in model.parameters() if p.requires_grad)} \n")
451
+
452
+ # Check the set difference between the two sets of model keys (model loaded from config and converted model)
453
+ print(
454
+ (
455
+ f"Keys in converted HF weight mapping but missing in HF model initialized from config.json: \n"
456
+ f"{set(hf_weights.keys()) - set(hf_model.state_dict().keys())} \n"
457
+ )
458
+ )
459
+ print(
460
+ (
461
+ f"Keys in HF model initialized from config.json but missing in converted HF weight mapping: \n"
462
+ f"{set(hf_model.state_dict().keys()) - set(hf_weights.keys())} \n"
463
+ )
464
+ )
465
+
466
+ print(
467
+ (
468
+ f"It is expected that lm_head.weight is missing from converted HF weight mapping \n"
469
+ f"if you have set share_decoder_tokens_head_embeddings=True in your Nemo config. \n"
470
+ f"This weight doesn't exist in Nemo, as it is shared with the decoder token embeddings. \n \n"
471
+ f"In Huggingface, weights for lm_head.weight and decoder token embeddings are generally duplicated \n"
472
+ f"in the state_dict. When missing, the lm_head.weight is automatically initialized from shared decoder \n"
473
+ f"token embeddings weights if your HF config.json has tie_word_embeddings=True."
474
+ )
475
+ )
476
+
477
+
478
+ if __name__ == "__main__":
479
+ parser = argparse.ArgumentParser(description="Convert Nemo T5/UL2 model to Huggingface T5/UL2 model")
480
+ parser.add_argument(
481
+ "--nemo_model_path",
482
+ type=str,
483
+ required=True,
484
+ help="Path to Nemo T5/UL2 model .ckpt file",
485
+ )
486
+ parser.add_argument(
487
+ "--hf_config_path",
488
+ type=str,
489
+ required=True,
490
+ help="Path to Huggingface T5 config.json",
491
+ )
492
+ parser.add_argument(
493
+ "--hf_model_path",
494
+ type=str,
495
+ required=False,
496
+ help="Path to Huggingface T5 model, local folder or HF hub model",
497
+ )
498
+ parser.add_argument(
499
+ "--output_path",
500
+ type=str,
501
+ required=True,
502
+ help="Folder to save converted Huggingface T5/UL2 model in",
503
+ )
504
+
505
+ parser.add_argument("--hidden_size", type=int, default=768, help="Hidden size of Nemo model")
506
+ parser.add_argument("--num_heads", type=int, default=12, help="Number of attention heads in Nemo model")
507
+ # Default False if --fix_qkv not specified
508
+ parser.add_argument("--fix_qkv", action="store_true", help="Fix QKV weights in converted HF model")
509
+ parser.add_argument("--checkpoint_version", type=float, default=2.0, help="Checkpoint version of Nemo model")
510
+ parser.add_argument(
511
+ "--kv_dim", type=int, default=64, help="Key/Value dimension of Nemo model. Typically hidden_size // num_heads"
512
+ )
513
+
514
+ args = parser.parse_args()
515
+
516
+ #### Convert Nemo T5/UL2 model to Huggingface T5/UL2 model
517
+ model = load_nemo_megatron_model(checkpoint_path=args.nemo_model_path)
518
+ nemo_weights = model.state_dict()
519
+
520
+ hf_weights = convert_nemo_to_hf(
521
+ nemo_weights=nemo_weights,
522
+ fix_qkv_ordering=args.fix_qkv,
523
+ hidden_size=args.hidden_size,
524
+ num_heads=args.num_heads,
525
+ kv_dim=args.kv_dim,
526
+ checkpoint_version=args.checkpoint_version,
527
+ )
528
+
529
+ # We trained with a HF tokenizer, we grab it from the Nemo model.
530
+ tokenizer = model.tokenizer.__dict__["tokenizer"]
531
+
532
+ # We manually create HF config.json that matches architecture of the nemo model
533
+ # (or grab one from existing model on HF Hub and modify where necessary).
534
+ # See example config.json
535
+ config = T5Config.from_json_file(args.hf_config_path)
536
+
537
+ # Save config
538
+ config.save_pretrained(args.output_path)
539
+ print(f"Saved config to {os.path.join(args.output_path, 'config.json')}")
540
+
541
+ # Save tokenizer
542
+ tokenizer.save_pretrained(args.output_path)
543
+ print(f"Saved tokenizer to {os.path.join(args.output_path, 'tokenizer.json')}")
544
+
545
+ # Save the converted weights to a file
546
+ torch.save(hf_weights, os.path.join(args.output_path, "pytorch_model.bin"))
547
+ print(f"Saved converted weights to {os.path.join(args.output_path, 'pytorch_model.bin')}")
548
+
549
+ # Sanity check
550
+ compare_weights_hf_nemo(model, hf_weights, hf_config_path=args.hf_config_path)
hf_t5_v1_1_to_nemo.py ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ """
16
+ This script generates a NeMo-Megatron compatible `.nemo` file for a Huggingface T5-v1_1 model.
17
+ List of Huggingface models that this script can covert:
18
+ 1. google/t5-v1_1-small
19
+ 2. google/t5-v1_1-base
20
+ 3. google/t5-v1_1-large
21
+ 4. google/t5-v1_1-xl
22
+ 5. google/t5-v1_1-xxl
23
+ 6. google/mt5-small
24
+ 7. google/mt5-base
25
+ 8. google/mt5-large
26
+ 9. google/mt5-xl
27
+ 10. google/mt5-xxl
28
+ 11. google/ul2
29
+ 13. bigscience/T0pp
30
+ 14. google/t5-small-lm-adapt
31
+ 15. google/t5-base-lm-adapt
32
+ 16. google/t5-large-lm-adapt
33
+ 17. google/t5-xl-lm-adapt
34
+ 18. google/t5-xxl-lm-adapt
35
+ 19. google/flan-t5-small
36
+ 20. google/flan-t5-base
37
+ 21. google/flan-t5-large
38
+ 22. google/flan-t5-xl
39
+ 23. google/flan-t5-xxl
40
+ Use instructions:
41
+ python hf_t5-v1_1_to_nemo.py \
42
+ --hf_model_name bigscience/T0pp \
43
+ --nemo_state_dict /path/to/nemo_state_dict.pt \
44
+ --nemo_file_path /path/to/nemo_file.nemo
45
+ """
46
+ import collections
47
+ import os
48
+ import tempfile
49
+ from argparse import ArgumentParser
50
+
51
+ import torch
52
+ from omegaconf.omegaconf import OmegaConf, open_dict
53
+ from pytorch_lightning import Trainer
54
+ from transformers import AutoTokenizer, T5ForConditionalGeneration
55
+
56
+ from nemo.collections.nlp.models.language_modeling.megatron_t5_model import MegatronT5Model
57
+ from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector
58
+
59
+ try:
60
+ import accelerate
61
+ except ImportError:
62
+ raise ImportError("Please install accelerate package via `pip install accelerate` to use this script.")
63
+
64
+
65
+ def convert_weights(hf_model, nemo_state_dict_path):
66
+ if hf_model == "google/ul2":
67
+ torch_dtype = torch.bfloat16
68
+ else:
69
+ torch_dtype = torch.float32
70
+ hf_model = T5ForConditionalGeneration.from_pretrained(hf_model, low_cpu_mem_usage=True, torch_dtype=torch_dtype)
71
+ hf_model_config = hf_model.config
72
+ with tempfile.TemporaryDirectory() as tmp:
73
+ torch.save(hf_model.state_dict(), os.path.join(tmp, "model.pt"))
74
+ hf_weights = torch.load(os.path.join(tmp, "model.pt"))
75
+
76
+ nemo_weights = collections.OrderedDict()
77
+
78
+ print(f"Found {len(hf_weights.keys())} keys in the checkpoint")
79
+
80
+ def _get_model_type_block_layer(k):
81
+ if k.startswith("encoder"):
82
+ model_type = "encoder"
83
+ elif k.startswith("decoder"):
84
+ model_type = "decoder"
85
+ else:
86
+ raise ValueError(f"Unknown model type for {k}")
87
+
88
+ return model_type, int(k.split(".")[2]), int(k.split(".")[4])
89
+
90
+ for k, v in hf_weights.items():
91
+ #################################################
92
+ ###### Enc-Dec Embeddings and Output Layer ######
93
+ #################################################
94
+ # Tied decoder embedding and decoder output layer.
95
+ if k == "shared.weight":
96
+ pass
97
+
98
+ elif k == "lm_head.weight":
99
+ nemo_weights["enc_dec_model.tokens_head.weight"] = v
100
+ print(
101
+ f"Mapped {k} to enc_dec_model.decoder_embedding.word_embeddings.weight and enc_dec_model.tokens_head.weight"
102
+ )
103
+
104
+ # Decoder embeddings
105
+ elif k == "decoder.embed_tokens.weight":
106
+ nemo_weights["enc_dec_model.decoder_embedding.word_embeddings.weight"] = v
107
+
108
+ elif k == "encoder.embed_tokens.weight":
109
+ nemo_weights["enc_dec_model.encoder_embedding.word_embeddings.weight"] = v
110
+ print(f"Mapped {k} to enc_dec_model.encoder_embedding.word_embeddings.weight")
111
+
112
+ #################################################
113
+ ################# RPE Weights ###################
114
+ #################################################
115
+
116
+ elif k == "encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight":
117
+ nemo_weights["enc_dec_model.encoder_relative_position_embedding.relative_position_embedding.weight"] = v
118
+ print(
119
+ f"Mapped {k} to enc_dec_model.encoder_relative_position_embedding.relative_position_embedding.weight"
120
+ )
121
+
122
+ elif k == "decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight":
123
+ nemo_weights["enc_dec_model.decoder_relative_position_embedding.relative_position_embedding.weight"] = v
124
+ print(
125
+ f"Mapped {k} to enc_dec_model.decoder_relative_position_embedding.relative_position_embedding.weight"
126
+ )
127
+
128
+ # Block in HF corresponds to layer in NeMo.
129
+ # Layer in HF does not correspond to anything in NeMo. Layer 0 is self attn, layer 1 is cross-attn.
130
+
131
+ #################################################
132
+ ############### Attention Layers ################
133
+ #################################################
134
+
135
+ # Self-Attention
136
+
137
+ # Q, k, V in NeMo-Megatron is bundled into a single matrix.
138
+ elif "SelfAttention.q.weight" in k:
139
+ model_type, block_number, layer_number = _get_model_type_block_layer(k)
140
+ k_weight = hf_weights[k.replace("q.weight", "k.weight")]
141
+ v_weight = hf_weights[k.replace("q.weight", "v.weight")]
142
+ concat_weights = torch.cat([v, k_weight, v_weight], dim=0)
143
+ nemo_weights[
144
+ f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.self_attention.query_key_value.weight"
145
+ ] = concat_weights
146
+ print(
147
+ f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.self_attention.query_key_value.weight"
148
+ )
149
+
150
+ # We can skip processing of k, v weights since we already concat them into qkv above.
151
+ elif "SelfAttention.k.weight" in k or "SelfAttention.v.weight" in k:
152
+ pass
153
+
154
+ # Output self-attn matrix.
155
+ elif "SelfAttention.o.weight" in k:
156
+ model_type, block_number, layer_number = _get_model_type_block_layer(k)
157
+ block_number = int(k.split(".")[2]) # Block in HF corresponds to layer in NeMo.
158
+ layer_number = int(
159
+ k.split(".")[4]
160
+ ) # Layer in HF does not correspond to anything in NeMo. Layer 0 is self attn, layer 1 is cross-attn.
161
+ nemo_weights[
162
+ f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.self_attention.dense.weight"
163
+ ] = v
164
+ print(
165
+ f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.self_attention.dense.weight"
166
+ )
167
+
168
+ # Cross-Attention projection matrices are merged into K, V matrices in NeMo-Megatron
169
+ elif "EncDecAttention.k.weight" in k:
170
+ model_type, block_number, layer_number = _get_model_type_block_layer(k)
171
+ v_weight = hf_weights[k.replace("k.weight", "v.weight")]
172
+ concat_weights = torch.cat([v, v_weight], dim=0)
173
+ nemo_weights[
174
+ f"enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.key_value.weight"
175
+ ] = concat_weights
176
+ print(
177
+ f"Mapped {k} to enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.key_value.weight"
178
+ )
179
+
180
+ # We can skip processing of v weights since we already concat them with k above.
181
+ elif "EncDecAttention.v.weight" in k:
182
+ pass
183
+
184
+ # Cross-Attention Q matrix is separate in NeMo-Megatron
185
+ elif "EncDecAttention.q.weight" in k:
186
+ model_type, block_number, layer_number = _get_model_type_block_layer(k)
187
+ nemo_weights[
188
+ f"enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.query.weight"
189
+ ] = v
190
+ print(
191
+ f"Mapped {k} to enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.query.weight"
192
+ )
193
+
194
+ # Cross-Attention Q matrix is separate in NeMo-Megatron
195
+ elif "EncDecAttention.o.weight" in k:
196
+ model_type, block_number, layer_number = _get_model_type_block_layer(k)
197
+ nemo_weights[
198
+ f"enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.dense.weight"
199
+ ] = v
200
+ print(
201
+ f"Mapped {k} to enc_dec_model.enc_dec_model.decoder.model.layers.{block_number}.inter_attention.dense.weight"
202
+ )
203
+
204
+ #################################################
205
+ #################$ FFN Layers ###################
206
+ #################################################
207
+
208
+ elif "DenseReluDense.wi_0.weight" in k:
209
+ model_type, block_number, layer_number = _get_model_type_block_layer(k)
210
+ nemo_weights[
211
+ f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_h_to_4h.weight"
212
+ ] = v
213
+ print(
214
+ f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_h_to_4h.weight"
215
+ )
216
+
217
+ elif "DenseReluDense.wi_1.weight" in k:
218
+ model_type, block_number, layer_number = _get_model_type_block_layer(k)
219
+ nemo_weights[
220
+ f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_h_to_4h_2.weight"
221
+ ] = v
222
+ print(
223
+ f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_h_to_4h_2.weight"
224
+ )
225
+
226
+ elif "DenseReluDense.wo.weight" in k:
227
+ model_type, block_number, layer_number = _get_model_type_block_layer(k)
228
+ nemo_weights[
229
+ f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_4h_to_h.weight"
230
+ ] = v
231
+ print(
232
+ f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.mlp.dense_4h_to_h.weight"
233
+ )
234
+
235
+ #################################################
236
+ #################$ LayerNorm ####################
237
+ #################################################
238
+
239
+ elif "layer_norm" in k:
240
+ if "final" in k:
241
+ model_type = "encoder" if k.startswith("encoder") else "decoder"
242
+ nemo_weights[f"enc_dec_model.enc_dec_model.{model_type}.model.final_layernorm.weight"] = v
243
+ print(f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.final_layernorm.weight")
244
+ else:
245
+ model_type, block_number, layer_number = _get_model_type_block_layer(k)
246
+ if layer_number == 0 and model_type == "encoder":
247
+ nemo_weights[
248
+ f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.input_layernorm.weight"
249
+ ] = v
250
+ print(
251
+ f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.input_layernorm.weight"
252
+ )
253
+ elif layer_number == 1 and model_type == "encoder":
254
+ nemo_weights[
255
+ f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_attention_layernorm.weight"
256
+ ] = v
257
+ print(
258
+ f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_attention_layernorm.weight"
259
+ )
260
+ elif layer_number == 0 and model_type == "decoder":
261
+ nemo_weights[
262
+ f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.input_layernorm.weight"
263
+ ] = v
264
+ print(
265
+ f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.input_layernorm.weight"
266
+ )
267
+ elif layer_number == 1 and model_type == "decoder":
268
+ nemo_weights[
269
+ f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_attention_layernorm.weight"
270
+ ] = v
271
+ print(
272
+ f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_attention_layernorm.weight"
273
+ )
274
+ elif layer_number == 2 and model_type == "decoder":
275
+ nemo_weights[
276
+ f"enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_inter_attention_layernorm.weight"
277
+ ] = v
278
+ print(
279
+ f"Mapped {k} to enc_dec_model.enc_dec_model.{model_type}.model.layers.{block_number}.post_inter_attention_layernorm.weight"
280
+ )
281
+ else:
282
+ raise ValueError("Unknown layer_norm key: {}".format(k))
283
+ else:
284
+ raise ValueError(f"Unknown key: {k}")
285
+
286
+ torch.save(nemo_weights, nemo_state_dict_path)
287
+ print("Saved weights to {}".format(nemo_state_dict_path))
288
+ return hf_model_config
289
+
290
+
291
+ def package_into_nemo_file(
292
+ state_dict_path, base_yaml_config, hf_model_config, nemo_file_path, hf_model_name, megatron_amp_O2
293
+ ):
294
+ """
295
+ Packages the state dict, config file and tokenizer into a `.nemo` file.
296
+ """
297
+ trainer = Trainer(devices=1, strategy=NLPDDPStrategy(), accelerator="cpu", precision=32)
298
+ base_cfg = OmegaConf.load(base_yaml_config)
299
+ if hf_model_config.dense_act_fn == "silu":
300
+ act_fn = "swiglu"
301
+ elif hf_model_config.dense_act_fn == "gelu_new":
302
+ act_fn = "geglu"
303
+ # FLAN-T5 models have things configured this way.
304
+ elif hf_model_config.dense_act_fn == "gelu" and hf_model_config.is_gated_act:
305
+ act_fn = "geglu"
306
+ else:
307
+ raise ValueError(f"Unknown dense_act_fn: {hf_model_config.dense_act_fn}")
308
+
309
+ with open_dict(base_cfg):
310
+ base_cfg.encoder.num_layers = hf_model_config.num_layers
311
+ base_cfg.encoder.hidden_size = hf_model_config.d_model
312
+ base_cfg.encoder.ffn_hidden_size = hf_model_config.d_ff
313
+ base_cfg.encoder.kv_channels = hf_model_config.d_kv
314
+ base_cfg.encoder.num_attention_heads = hf_model_config.num_heads
315
+ base_cfg.encoder.activation = act_fn
316
+ base_cfg.encoder.relative_attention_num_buckets = hf_model_config.relative_attention_num_buckets
317
+
318
+ base_cfg.decoder.num_layers = hf_model_config.num_decoder_layers
319
+ base_cfg.decoder.hidden_size = hf_model_config.d_model
320
+ base_cfg.decoder.ffn_hidden_size = hf_model_config.d_ff
321
+ base_cfg.decoder.kv_channels = hf_model_config.d_kv
322
+ base_cfg.decoder.num_attention_heads = hf_model_config.num_heads
323
+ base_cfg.decoder.activation = act_fn
324
+ base_cfg.decoder.relative_attention_num_buckets = hf_model_config.relative_attention_num_buckets
325
+
326
+ base_cfg.megatron_amp_O2 = megatron_amp_O2
327
+
328
+ with tempfile.TemporaryDirectory() as tmp:
329
+ tokenizer = AutoTokenizer.from_pretrained(hf_model_name)
330
+ tokenizer_path = tokenizer.save_vocabulary(tmp)[0]
331
+ base_cfg.tokenizer.model = tokenizer_path
332
+ model = MegatronT5Model(base_cfg, trainer).to("cpu")
333
+ model._save_restore_connector = NLPSaveRestoreConnector()
334
+ state_dict = torch.load(state_dict_path)
335
+ if megatron_amp_O2:
336
+ new_state_dict = {}
337
+ for key in state_dict.keys():
338
+ new_key = key.replace("model.", "model.module.", 1)
339
+ new_state_dict[new_key] = state_dict[key]
340
+ state_dict = new_state_dict
341
+ model.load_state_dict(state_dict)
342
+ model.save_to(nemo_file_path)
343
+
344
+
345
+ if __name__ == "__main__":
346
+ parser = ArgumentParser()
347
+ parser.add_argument(
348
+ "--hf_model_name",
349
+ type=str,
350
+ required=True,
351
+ help="Valid Huggingface T5v1_1 model name ex: google/t5-v1_1-large or google/ul2. Example something that can be loaded with T5ForConditionalGeneration.from_pretrained()",
352
+ )
353
+ parser.add_argument(
354
+ "--nemo_state_dict_path",
355
+ type=str,
356
+ required=True,
357
+ help="Path to write the intermediate nemo state dict file ex: /path/to/nemo_state_dict.pt",
358
+ )
359
+ parser.add_argument(
360
+ "--nemo_file_path",
361
+ type=str,
362
+ required=True,
363
+ help="Path to write the converted .nemo file ex: /path/to/t5_base_converted_to_nemo.nemo",
364
+ )
365
+ parser.add_argument(
366
+ "--base_yaml_config",
367
+ type=str,
368
+ default="hf_t5v1_1_base_config.yaml",
369
+ help="Path to a base yaml config that we edit based on the provided model.",
370
+ )
371
+ parser.add_argument(
372
+ "--megatron_amp_O2",
373
+ action="store_true",
374
+ help="Whether to store O2 weights. This may be useful for models like ul2 where only pre-trained half precision weights were released.",
375
+ )
376
+ args = parser.parse_args()
377
+ if not os.path.exists(args.base_yaml_config):
378
+ raise FileNotFoundError(f"Base yaml config file {args.base_yaml_config} does not exist.")
379
+ hf_model_config = convert_weights(args.hf_model_name, args.nemo_state_dict_path)
380
+ package_into_nemo_file(
381
+ state_dict_path=args.nemo_state_dict_path,
382
+ base_yaml_config=args.base_yaml_config,
383
+ hf_model_config=hf_model_config,
384
+ nemo_file_path=args.nemo_file_path,
385
+ hf_model_name=args.hf_model_name,
386
+ megatron_amp_O2=args.megatron_amp_O2,
387
+ )
hf_t5v1_1_base_config.yaml ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ encoder:
2
+ num_layers: 8
3
+ hidden_size: 512
4
+ ffn_hidden_size: 1024
5
+ num_attention_heads: 6
6
+ init_method_std: 0.02
7
+ hidden_dropout: 0.0
8
+ attention_dropout: 0.0
9
+ ffn_dropout: 0.0
10
+ position_embedding_type: relative
11
+ relative_attention_num_buckets: 32
12
+ relative_attention_max_distance: 128
13
+ relative_position_bias_self_attention_only: true
14
+ kv_channels: 64
15
+ apply_query_key_layer_scaling: false
16
+ layernorm_epsilon: 1.0e-06
17
+ persist_layer_norm: true
18
+ bias_activation_fusion: false
19
+ grad_div_ar_fusion: true
20
+ masked_softmax_fusion: false
21
+ bias_dropout_add_fusion: false
22
+ bias: false
23
+ normalization: rmsnorm
24
+ arch: transformer
25
+ activation: geglu
26
+ headscale: false
27
+ transformer_block_type: pre_ln
28
+ hidden_steps: 32
29
+ num_self_attention_per_cross_attention: 1
30
+ openai_gelu: true
31
+ onnx_safe: false
32
+ fp32_residual_connection: false
33
+ activations_checkpoint_method: null
34
+ activations_checkpoint_num_layers: 1
35
+ megatron_legacy: true
36
+ normalize_attention_scores: false
37
+ decoder:
38
+ num_layers: 8
39
+ hidden_size: 512
40
+ ffn_hidden_size: 1024
41
+ num_attention_heads: 6
42
+ init_method_std: 0.02
43
+ hidden_dropout: 0.0
44
+ attention_dropout: 0.0
45
+ ffn_dropout: 0.0
46
+ position_embedding_type: relative
47
+ relative_attention_num_buckets: 32
48
+ relative_attention_max_distance: 128
49
+ relative_position_bias_self_attention_only: true
50
+ kv_channels: 64
51
+ apply_query_key_layer_scaling: false
52
+ layernorm_epsilon: 1.0e-06
53
+ persist_layer_norm: true
54
+ bias_activation_fusion: false
55
+ grad_div_ar_fusion: true
56
+ masked_softmax_fusion: false
57
+ bias_dropout_add_fusion: false
58
+ bias: false
59
+ normalization: rmsnorm
60
+ arch: transformer
61
+ activation: geglu
62
+ headscale: false
63
+ transformer_block_type: pre_ln
64
+ hidden_steps: 32
65
+ num_self_attention_per_cross_attention: 1
66
+ openai_gelu: true
67
+ onnx_safe: false
68
+ fp32_residual_connection: false
69
+ activations_checkpoint_method: null
70
+ activations_checkpoint_num_layers: 1
71
+ megatron_legacy: true
72
+ normalize_attention_scores: false
73
+ micro_batch_size: 4
74
+ global_batch_size: 8
75
+ tensor_model_parallel_size: 1
76
+ pipeline_model_parallel_size: 1
77
+ resume_from_checkpoint: null
78
+ pipeline_model_parallel_split_rank: 0
79
+ make_vocab_size_divisible_by: 128
80
+ megatron_amp_O2: false
81
+ grad_allreduce_chunk_size_mb: 125
82
+ grad_div_ar_fusion: true
83
+ gradient_as_bucket_view: true
84
+ seq_length: 512
85
+ max_position_embeddings: 512
86
+ tokenizer:
87
+ library: sentencepiece
88
+ type: null
89
+ model: nemo:ce65b6d8f4fb4975955e935db699cba3_t5_small_tokenizer.model
90
+ vocab_file: null
91
+ merge_file: null
92
+ num_sentinel_tokens: 100
93
+ sentencepiece_legacy: true
94
+ add_sentinel_tokens_in_reverse_order: true
95
+ add_sentinel_tokens_first: true
96
+ embedding_init_method_std: 0.02
97
+ embedding_dropout: 0.1
98
+ share_token_embeddings: true
99
+ share_decoder_tokens_head_embeddings: false
100
+ tokens_head_bias: false
101
+ native_amp_init_scale: 4294967296
102
+ native_amp_growth_interval: 1000
103
+ fp16_lm_cross_entropy: false
104
+ seed: 1234
105
+ use_cpu_initialization: false
106
+ apex_transformer_log_level: 30
107
+ data:
108
+ data_prefix: null
109
+ index_mapping_dir: null
110
+ data_impl: mmap
111
+ splits_string: 949,45,5
112
+ seq_length: 512
113
+ seq_length_dec: 128
114
+ skip_warmup: true
115
+ num_workers: 0
116
+ dataloader_type: single
117
+ masked_lm_prob: 0.15
118
+ dataset_type: t5
119
+ short_seq_prob: 0.0
120
+ max_ngram_size: 10
121
+ mean_ngram_size: null
122
+ geometric_dist: true
123
+ permutation: false
124
+ whole_word_masking: false
125
+ favor_longer_ngrams: false
126
+ respect_document_boundaries: true
127
+ optim:
128
+ name: fused_adam
129
+ lr: 0.0001
130
+ betas:
131
+ - 0.9
132
+ - 0.999
133
+ eps: 1.0e-08
134
+ weight_decay: 0.01
135
+ sched:
136
+ name: WarmupAnnealing
137
+ min_lr: 1.0e-05
138
+ last_epoch: -1
139
+ warmup_ratio: 0.01
140
+ precision: bf16
141
+ target: nemo.collections.nlp.models.language_modeling.megatron_t5_model.MegatronT5Model
142
+ nemo_version: 1.11.0rc0
143
+ library: huggingface-t5v1_1 # options ['huggingface-t5v1_1', 'nemo-megatron']
nemo_checkpoints/megatron_ul2--val_loss=2.54-step=7000-consumed_samples=14557920.0.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cbce1edc1a23b6f3db975f8bf876ca9d32a3d86a0018a594fd96bfaffbcbf261
3
+ size 7730365530
nemo_checkpoints/megatron_ul2--val_loss=6.59-step=150-consumed_samples=309920.0-last.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fbeabc36b95c4a017afbfae0baeefbca0a5bf3445a24bd6706c9ae7b22327df0
3
+ size 7730362572
nemo_config/ul2-base-nl36/megatron.ul2-base-nl36.unigram-64k-pretok-small_data.all-clean.config.yaml ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ defaults:
2
+ - .@model.encoder: megatron_model_ul2base_config
3
+ - .@model.decoder: megatron_model_ul2base_config
4
+
5
+ name: megatron_ul2
6
+ restore_from_path: null # used when starting from a .nemo file
7
+
8
+ trainer:
9
+ devices: 1
10
+ num_nodes: 1
11
+ accelerator: gpu
12
+ precision: 16
13
+ logger: False # logger provided by exp_manager
14
+ enable_checkpointing: False
15
+ replace_sampler_ddp: False
16
+ max_epochs: -1 # PTL default. In practice, max_steps will be reached first.
17
+ max_steps: 524288 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
18
+ log_every_n_steps: 100
19
+ val_check_interval: 1000
20
+ limit_val_batches: 30
21
+ limit_test_batches: 500
22
+ accumulate_grad_batches: 1
23
+ gradient_clip_val: 1.0
24
+
25
+ exp_manager:
26
+ explicit_log_dir: null
27
+ exp_dir: /project/scratch/p200097/nemo_experiments/
28
+ name: megatron.ul2-base-nl36.unigram-64k-pretok-small_data.all-clean
29
+ create_wandb_logger: False
30
+ wandb_logger_kwargs:
31
+ project: null
32
+ name: null
33
+ resume_if_exists: True
34
+ resume_ignore_no_checkpoint: True
35
+ create_checkpoint_callback: True
36
+ checkpoint_callback_params:
37
+ monitor: val_loss
38
+ save_top_k: 10
39
+ mode: min
40
+ always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
41
+ filename: '${name}--{val_loss:.2f}-{step}-{consumed_samples}'
42
+ model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
43
+
44
+ model:
45
+ # model parallelism
46
+ micro_batch_size: 10
47
+ # 4 GPUS * 24 nodes = 96 GPUS
48
+ # 96 GPUS * 7 micro_batch_size = 672 batch_size
49
+ # 672 * 3 = 2016 global_batch_size
50
+ global_batch_size: 2080 # will use more micro batches to reach global batch size
51
+ tensor_model_parallel_size: 1
52
+ pipeline_model_parallel_size: 1
53
+ resume_from_checkpoint: null # manually set the checkpoint file to load from
54
+ pipeline_model_parallel_split_rank: 0 # rank at which decoder starts.
55
+
56
+ # model architecture
57
+ make_vocab_size_divisible_by: 128 # Pad the vocab size to be divisible by this value for computation efficiency.
58
+
59
+ megatron_amp_O2: False # use AMP with O2 style mixed precision instead of native amp on-the-fly weight autocasting.
60
+ grad_allreduce_chunk_size_mb: 125
61
+ grad_div_ar_fusion: True # Fuse grad division into torch.distributed.all_reduce
62
+ gradient_as_bucket_view: True # Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory)
63
+
64
+ seq_length: 512
65
+ max_position_embeddings: ${.seq_length}
66
+
67
+
68
+ tokenizer:
69
+ library: 'huggingface'
70
+ type: 'KBLab/unigram-64k-pretok-small_data-tokenizer'
71
+ model: null
72
+ vocab_file: null
73
+ merge_file: null
74
+ num_sentinel_tokens: 256
75
+ sentencepiece_legacy: True # Legacy=True allows you to add special tokens to sentencepiece tokenizers.
76
+
77
+ # tokenizer:
78
+ # library: 'megatron'
79
+ # type: 'BertWordPieceCase'
80
+ # model: null
81
+ # vocab_file: null
82
+ # merge_file: null
83
+ # num_sentinel_tokens: 100
84
+ # sentencepiece_legacy: True # Legacy=True allows you to add special tokens to sentencepiece tokenizers.
85
+
86
+ # weight init
87
+ embedding_init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
88
+
89
+ # embedding dropout
90
+ embedding_dropout: 0.1
91
+
92
+ # embedding sharing
93
+ share_token_embeddings: True # If True share encoder/decoder embeddings
94
+ share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits
95
+
96
+ # token head
97
+ tokens_head_bias: False
98
+
99
+ # precision
100
+ native_amp_init_scale: 4294967296 # 2 ** 32
101
+ native_amp_growth_interval: 1000
102
+ fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16
103
+
104
+ # miscellaneous
105
+ seed: 1234
106
+ use_cpu_initialization: False # Init weights on the CPU (slow for large models)
107
+ apex_transformer_log_level: 30 # Python logging level displays logs with severity greater than or equal to this
108
+
109
+ data:
110
+ # Path to data must be specified by the user.
111
+ # can override from the CLI: "model.data.data_prefix=[.5,/raid/data/pile/my-t5_00_text_document,.5,/raid/data/pile/my-t5_01_text_document]",
112
+ # Or see example below:
113
+ # data_prefix:
114
+ # - .5
115
+ # - /raid/data/pile/my-t5_00_text_document
116
+ # - .5
117
+ # - /raid/data/pile/my-t5_01_text_document
118
+ data_prefix:
119
+ - 0.005
120
+ - /project/scratch/p200097/data/unigram-64k-pretok-small_data/wikipedia-unigram-64k-pretok-small_data_text_sentence
121
+ - 0.035
122
+ - /project/scratch/p200097/data/unigram-64k-pretok-small_data/edepos_html-unigram-64k-pretok-small_data_text_sentence
123
+ - 0.030
124
+ - /project/scratch/p200097/data/unigram-64k-pretok-small_data/oscar-unigram-64k-pretok-small_data_text_sentence
125
+ - 0.105
126
+ - /project/scratch/p200097/data/unigram-64k-pretok-small_data/kw3-2017-unigram-64k-pretok-small_data_text_sentence
127
+ - 0.177
128
+ - /project/scratch/p200097/data/unigram-64k-pretok-small_data/issues-unigram-64k-pretok-small_data_text_sentence
129
+ - 0.648
130
+ - /project/scratch/p200097/data/unigram-64k-pretok-small_data/mc4-unigram-64k-pretok-small_data_text_sentence
131
+ index_mapping_dir: /project/scratch/p200097/data/unigram-64k-pretok-small_data/npy_files_ul2/ # path to save index mapping .npy files, by default will save in the same location as data_prefix
132
+ data_impl: mmap
133
+ # data_impl_kwargs: # currently used only for text_mmap, csv_mmap (should be data_impl dependant)
134
+ # # defaults for text_memmap
135
+ # newline_int: 10 # byte-value of newline (Use ord('\n') to get value)
136
+ # header_lines: 0 # skip first N header lines
137
+ # workers: null # number of workers when creating missing index files (null defaults to cpu_num // 2)
138
+ # sort_dataset_paths: False # if True datasets will be sorted by name
139
+ # # defaults for csv_memmap
140
+ # newline_int: 10 # byte-value of newline
141
+ # header_lines: 1 # skip first N header lines
142
+ # workers: null # number of workers when creating missing index files (null defaults to cpu_num // 2)
143
+ # sort_dataset_paths: False # if True datasets will be sorted by name
144
+ # data_col: 1 # column to use for data
145
+ # data_sep: ',' # string to split text into columns
146
+ splits_string: 996,2,2
147
+ seq_length: ${model.seq_length}
148
+ seq_length_dec: ${model.seq_length}
149
+ skip_warmup: True
150
+ num_workers: 32
151
+ dataloader_type: single # cyclic
152
+ masked_lm_prob: 0.15
153
+ extreme_masked_lm_prob: 0.5
154
+ dataset_type: 'ul2'
155
+ short_seq_prob: 0.0
156
+ max_ngram_size: 10
157
+ extreme_max_ngram_size: 128
158
+ extreme_min_ngram_size: 32
159
+ extreme_mean_ngram_size: 64
160
+ ngram_span_length_distribution: 'geometric'
161
+ extreme_ngram_span_length_distribution: 'truncated_normal'
162
+ prefix_lm_pivot_mean: 0.25
163
+ mean_ngram_size: 3
164
+ permutation: False
165
+ whole_word_masking: True
166
+ favor_longer_ngrams: False
167
+ respect_document_boundaries: True # If true, a single training exampl cannot cross document boundaries, increasing the fraction of <pad> tokens within a batch.
168
+
169
+ optim:
170
+ name: fused_adam
171
+ lr: 0.001
172
+ weight_decay: 0.01
173
+ betas:
174
+ - 0.9
175
+ - 0.999
176
+ eps: 1e-8
177
+ sched:
178
+ name: CosineAnnealing
179
+ warmup_steps: 1600
180
+ constant_steps: 30000 #40000
181
+ min_lr: 5e-6
182
+
183
+ # optim:
184
+ # name: fused_adam
185
+ # lr: 0.0001
186
+ # betas:
187
+ # - 0.9
188
+ # - 0.999
189
+ # eps: 1e-8
190
+ # weight_decay: 0.01
191
+ # sched:
192
+ # name: WarmupAnnealing
193
+ # min_lr: 0.00001
194
+ # last_epoch: -1
195
+ # warmup_ratio: 0.005
nemo_config/ul2-base-nl36/megatron_model_ul2base_config.yaml ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ num_layers: 36 # For perceiver models, this is the number of cross-attention blocks. Each layer has 1 cross-attention and "num_self_attention_per_cross_attention" self-attention layers.
2
+ hidden_size: 768
3
+ ffn_hidden_size: 2048 # Transformer FFN hidden size. Usually 4 * hidden_size. Since we use Swiglu, which uses extra projection weight matrix, we use 2/3 * 4 * ffn_hidden_size (see https://arxiv.org/abs/2002.05202)
4
+ num_attention_heads: 12
5
+ init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
6
+ hidden_dropout: 0.0 # Dropout probability for hidden state transformer. "Dropout is set to 0 during pretraining" - UL2 paper
7
+ attention_dropout: 0.0 # Dropout probability in the attention layer. "Dropout is set to 0 during pretraining" - UL2 paper
8
+ ffn_dropout: 0.0 # Dropout probability in the feed-forward layer. "Dropout is set to 0 during pretraining" - UL2 paper
9
+ position_embedding_type: 'relative' # Position embedding type. Options ['learned_absolute', 'relative', 'alibi']
10
+ relative_attention_num_buckets: 32 # Relative position number of buckets for computing the bias
11
+ relative_attention_max_distance: 128 # max_distance to keep relative distance in the attention_num_buckets.
12
+ relative_position_bias_self_attention_only: True # whether to only use relative position bias for self attention only.
13
+ kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null
14
+ apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number.
15
+ layernorm_epsilon: 1e-5
16
+ persist_layer_norm: True # Use of persistent fused layer norm kernel.
17
+ bias_activation_fusion: False # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function.
18
+ grad_div_ar_fusion: True # Fuse grad division into torch.distributed.all_reduce
19
+ masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
20
+ bias_dropout_add_fusion: False # Use a kernel that fuses the bias addition, dropout and residual connection addition.
21
+ bias: False # Whether to use bias terms in all weight matrices.
22
+ normalization: 'rmsnorm' # Normalization layer to use. Options are 'layernorm', 'rmsnorm'
23
+ arch: 'transformer' # Options: ['transformer', 'perceiver']
24
+ activation: 'swiglu' # Options ['gelu', 'geglu', 'swiglu', 'reglu', 'squared-relu', 'fast-geglu', 'fast-swiglu', 'fast-reglu']
25
+ headscale: False # Whether to learn extra parameters that scale the output of the each self-attention head.
26
+ transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
27
+ hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders
28
+ num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer.
29
+ openai_gelu: False # Use OpenAI's GELU instead of the default GeLU
30
+ onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
31
+ fp32_residual_connection: False # Use FP32 for residual connections.
32
+ activations_checkpoint_method: null # 'uniform', 'block'
33
+ activations_checkpoint_num_layers: 1
34
+ activations_checkpoint_granularity: null # SELECTIVE: https://github.com/NVIDIA/NeMo/pull/4380
35
+ megatron_legacy: False # Whether to use the legacy Megatron model. This affects the way q,k,v is partitioned from the mixed q,k,v layer in ParallelAttention. This needs to be True for models converted from HF.
36
+ normalize_attention_scores: True # Whether to scale the output Q * K^T by 1 / sqrt(hidden_size_per_head). This arg is provided as a configuration option mostly for compatibility with models that have been weight-converted from HF. You almost always want to se this to True.
37
+ num_moe_experts: 1 # When >1, FFNs are changed to MoE layers
38
+ moe_frequency: 1 # every Nth ffn layer will be made MoE
39
+ moe_dropout: 0.0 # Dropout value for MoE layers
40
+ # https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/hf_t5v1_1_base_config.yaml
nemo_singularity.def ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ BootStrap: docker
2
+ From: nvcr.io/nvidia/nemo:23.02
3
+
4
+ %environment
5
+ export LC_ALL=C
6
+
7
+ %post
8
+ cd /usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/data/language_modeling/megatron
9
+ make
10
+ pip install accelerate
11
+
test_ul2_hf.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import AutoTokenizer, T5ForConditionalGeneration, T5Tokenizer
3
+ from nemo.collections.nlp.models.language_modeling.megatron_t5_model import MegatronT5Model
4
+ from nemo.collections.nlp.data.language_modeling.megatron.ul2_dataset import UL2Dataset
5
+ from pytorch_lightning.trainer.trainer import Trainer
6
+
7
+
8
+ def load_nemo_megatron_model(checkpoint_path, devices=1, num_nodes=1, accelerator="gpu"):
9
+ trainer = Trainer(devices=devices, num_nodes=num_nodes, accelerator=accelerator)
10
+ model = MegatronT5Model.load_from_checkpoint(checkpoint_path, trainer=trainer)
11
+
12
+ return model
13
+
14
+
15
+ #### Huggingface ####
16
+ tokenizer = AutoTokenizer.from_pretrained("ul2-base-nl36-swedish")
17
+ model = T5ForConditionalGeneration.from_pretrained("ul2-base-nl36-swedish")
18
+
19
+ # "Hunden bet mannen i" means "The dog bit the man in".
20
+ input_ids = tokenizer(
21
+ "<extra_id_r> Hunden bet mannen i <extra_id_0>", return_tensors="pt", return_token_type_ids=False
22
+ )
23
+ # Predict with HF
24
+ with torch.no_grad():
25
+ outputs_hf = model(
26
+ input_ids=input_ids.input_ids,
27
+ attention_mask=input_ids.attention_mask,
28
+ decoder_input_ids=input_ids.input_ids,
29
+ decoder_attention_mask=input_ids.attention_mask,
30
+ )
31
+
32
+
33
+ # Argmax to get the most probable token id
34
+ output_tokens_hf = outputs_hf[0].argmax(dim=-1)
35
+
36
+ #### Nemo ####
37
+ model_nemo = load_nemo_megatron_model("nemo_checkpoints/megatron_ul2--val_loss=2.54-step=7000-consumed_samples=14557920.0.ckpt")
38
+ model_nemo.eval()
39
+
40
+ tokenizer_nemo = model_nemo.tokenizer.tokenizer
41
+ input_ids_nemo = tokenizer_nemo("<extra_id_r> Hunden bet mannen i <extra_id_0>", return_tensors="pt").to("cuda")
42
+
43
+ # Predict with Nemo
44
+ with torch.no_grad():
45
+ outputs_nemo = model_nemo(
46
+ encoder_input_ids=input_ids_nemo.input_ids,
47
+ decoder_input_ids=input_ids_nemo.input_ids,
48
+ encoder_attn_mask=input_ids_nemo.attention_mask,
49
+ decoder_attn_mask=input_ids_nemo.attention_mask,
50
+ )
51
+ # Argmax to get the most probable token
52
+ output_tokens = outputs_nemo.argmax(dim=-1)
53
+
54
+
55
+ #### Compare both outputs ####
56
+ print(f"Nemo logits: {outputs_nemo[0]}")
57
+ print(f"Huggingface logits: {outputs_hf[0]}")
58
+ print(f"Are logits equal: {torch.allclose(outputs_nemo[0], outputs_hf[0].to('cuda'))}")
59
+
60
+ # Decode tokens
61
+ print(f"Huggingface output: {tokenizer.batch_decode(output_tokens_hf)}")
62
+ print(f"Nemo output: {tokenizer_nemo.batch_decode(output_tokens)}") # Reasonable output for undertrained model