bigscience/bloom-1b1-optimizer-states · Mismatch layers between global step and HF

Dec 26, 2022

•

edited Dec 26, 2022

Hi! I'm experiencing some issues while trying to export the zero checkpoint to fp16 and load it using the HF interface.

I started finetuning the model from these optimizer states, then I convert the final checkpoint using the zero_to_fp32.py script which is created by Megatron-Deepspeed (with some modifications, e.g. I had to modify the names of the layers from n.something.something to h.n.something.something, which are the layers names in the HF loadable model.

However, even after that, there is still a mismatch between some layers names in the checkpoint and the HF model. Here's an example of the warnings I get when I try to load the finetuned model in HF:

>>> from transformer import AutoModelForCausalLM
>>> model = AutoModelForCausalLM.from_pretrained("1b1/")
Some weights of the model checkpoint at 1b1/ were not used when initializing BloomForCausalLM: ['h.25.mlp.dense_4h_to_h.bias', 'h.26.post_attention_layernorm.weight', 'h.25.mlp.dense_h_to_4h.bias', 'h.25.self_attention.dense.weight', 'h.25.post_attention_layernorm.weight', 'h.28.weight', 'h.25.self_attention.dense.bias', 'h.26.mlp.dense_h_to_4h.bias', 'h.24.input_layernorm.bias', 'h.24.self_attention.query_key_value.bias', 'h.26.post_attention_layernorm.bias', 'h.25.post_attention_layernorm.bias', 'h.25.self_attention.query_key_value.bias', 'h.28.bias', 'h.26.input_layernorm.bias', 'h.24.post_attention_layernorm.weight', 'h.26.input_layernorm.weight', 'h.24.mlp.dense_4h_to_h.bias', 'h.24.mlp.dense_h_to_4h.bias', 'h.26.self_attention.query_key_value.weight', 'h.25.input_layernorm.bias', 'h.tied_modules.embed.word_embeddings.norm.weight', 'h.25.mlp.dense_4h_to_h.weight', 'h.26.self_attention.dense.bias', 'h.24.self_attention.dense.bias', 'h.26.self_attention.query_key_value.bias', 'h.24.self_attention.query_key_value.weight', 'h.25.self_attention.query_key_value.weight', 'h.24.mlp.dense_4h_to_h.weight', 'h.24.post_attention_layernorm.bias', 'h.25.mlp.dense_h_to_4h.weight', 'h.24.mlp.dense_h_to_4h.weight', 'h.26.self_attention.dense.weight', 'h.26.mlp.dense_h_to_4h.weight', 'h.26.mlp.dense_4h_to_h.weight', 'h.tied_modules.embed.word_embeddings.norm.bias', 'h.24.input_layernorm.weight', 'h.25.input_layernorm.weight', 'h.26.mlp.dense_4h_to_h.bias', 'h.24.self_attention.dense.weight', 'h.tied_modules.embed.word_embeddings.weight']
- This IS expected if you are initializing BloomForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BloomForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BloomForCausalLM were not initialized from the model checkpoint at 1b1/ and are newly initialized: ['h.2.mlp.dense_4h_to_h.weight', 'h.1.post_attention_layernorm.weight', 'h.2.post_attention_layernorm.weight', 'h.0.input_layernorm.bias', 'h.0.self_attention.query_key_value.weight', 'h.0.self_attention.query_key_value.bias', 'h.0.post_attention_layernorm.weight', 'h.1.post_attention_layernorm.bias', 'h.1.mlp.dense_h_to_4h.weight', 'word_embeddings_layernorm.weight', 'h.1.self_attention.query_key_value.weight', 'h.1.self_attention.dense.weight', 'h.2.input_layernorm.weight', 'h.1.self_attention.dense.bias', 'h.2.post_attention_layernorm.bias', 'h.0.input_layernorm.weight', 'h.2.self_attention.dense.bias', 'h.2.mlp.dense_h_to_4h.bias', 'h.0.mlp.dense_h_to_4h.bias', 'h.0.self_attention.dense.bias', 'h.1.input_layernorm.weight', 'h.1.input_layernorm.bias', 'h.2.self_attention.dense.weight', 'word_embeddings_layernorm.bias', 'h.0.self_attention.dense.weight', 'h.0.mlp.dense_4h_to_h.bias', 'h.1.self_attention.query_key_value.bias', 'h.0.mlp.dense_h_to_4h.weight', 'word_embeddings.weight', 'h.2.input_layernorm.bias', 'h.1.mlp.dense_h_to_4h.bias', 'h.0.mlp.dense_4h_to_h.weight', 'h.2.mlp.dense_4h_to_h.bias', 'h.2.self_attention.query_key_value.bias', 'h.1.mlp.dense_4h_to_h.weight', 'h.2.mlp.dense_h_to_4h.weight', 'h.2.self_attention.query_key_value.weight', 'ln_f.bias', 'h.0.post_attention_layernorm.bias', 'ln_f.weight', 'h.1.mlp.dense_4h_to_h.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

In practice, the performances of the model seems to be very compromised as well. Do you have any idea about how why this is happening and why the checkpoint has a different architecture from the HF available model?

Muennighoff

BigScience Workshop org Dec 26, 2022

Can you try this conversion script instead: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bloom/convert_bloom_original_checkpoint_to_pytorch.py
🧐

gcamposampie

Dec 26, 2022

This script worked perfectly! Thanks for the quick fix.

gcamposampie changed discussion status to closed Dec 26, 2022