Missing model_weights/model.rm_head._extra_state
#1
by
keminglu
- opened
Thanks for sharing such a great reward model!
I am trying to reproduce the result on RewardBench and serve this checkpoint with the official docker built from the Dockerfile:
nemo_aligner 0.4.0.dev0
nemo_toolkit 2.0.0rc0
megatron_core 0.7.0
transformer-engine 1.7.0.dev0+a51ff54
transformers 4.40.2
However, I run into this checkpoint missing error:
load
sharded_objects, sharded_state_dict = load_sharded_objects(
File "/home/data/lukeming.lkm/NeMo-Aligner/build_env/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 227, in load_sharded_objects
return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
File "/home/data/lukeming.lkm/NeMo-Aligner/build_env/Megatron-LM/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
x[k] = dict_list_map_inplace(f, v)
File "/home/data/lukeming.lkm/NeMo-Aligner/build_env/Megatron-LM/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
x[k] = dict_list_map_inplace(f, v)
File "/home/data/lukeming.lkm/NeMo-Aligner/build_env/Megatron-LM/megatron/core/dist_checkpointing/dict_utils.py", line 184, in dict_list_map_inplace
return f(x)
File "/home/data/lukeming.lkm/NeMo-Aligner/build_env/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 224, in load_sharded_object
raise CheckpointingException(err_msg) from e
megatron.core.dist_checkpointing.core.CheckpointingException: Object shard /cpfs_01/296bu4pgyxye1ubw3fj/data/shared/Group-m6/lukeming.lkm/ckpts/public/Nemotron-4-340B-Reward/model_weights/model.rm_head._extra_state/shard_0_1.pt not found
It looks like there is no model.rm_head._extra_state
part in the checkpoint.
Hi the issue was with a wrong container that we pointed to initially. We have fixed it in the model card with the path to the right container. Please try it and let us know if you have further questions
keminglu
changed discussion status to
closed