What to put in CHECKPOINT_PATH?

#8
by zero1zero - opened

When specifying my checkpoint as meditron what should I be putting in the associated directory, as defined and replaced here:

CHECKPOINTS = {
    ("pmc", 7): "/pure-mlo-scratch/alhernan/megatron-data/checkpoints/llamaPMC-7b-tp4-pp1",
    ("baseline", 7): "/pure-mlo-scratch/alhernan/megatron-data/checkpoints/llama2-7b-tp4-pp1",
    ("baseline", 70): "/pure-mlo-scratch/alhernan/megatron-data/checkpoints/llama2-70b-tp8-pp8",
    ("meditron", 7): "/pure-mlo-scratch/trial-runs/meditron-7b/checkpoints/llama2-7b-tp4-pp1",
    ("meditron", 70): "/pure-mlo-scratch/trial-runs/meditron-70b/checkpoints/llama2-70b-tp8-pp8"
}

For context, here are the errors I'm running into, which I am guessing are related to my checkpoint format. Everything appears to run correction without errors up until this point:

Done! Now finalizing.
Status path: /workspace/summarization/data/scratch/alhernan/megatron-data/checkpoints/instructed/llama-2-7b-tp4-pp1-meditron-summarization-seq4096/.status.txt
Trying to infer the number of documents in the dataset

Settings:
RANK=0
ADDR=localhost
N_NODES=1
DATA_ARGS=--train_data_path /workspace/data/scratch/zechen/meditron/benchmarks/ft_preprocessed/tokenized/summarization/summarization --valid_data_path /workspace/data/scratch/zechen/meditron/benchmarks/ft_preprocessed/tokenized/summarization-val/summarization-val
CHECKPOINT_PATH=/workspace/data/scratch/trial-runs/meditron-7b/checkpoints/llama2-7b-tp4-pp1
TRAINED_PATH=/workspace/data/scratch/alhernan/megatron-data/checkpoints/instructed/llama-2-7b-tp4-pp1-meditron-summarization-seq4096
MODEL=llama2
TP=4
PP=1
MICRO_BATCH=1
GLOBAL_BATCH=64
INSTRUCT=1
COMMON_ARGS=--use_flash_attn --no_bias_gelu_fusion --seq_length 4096 --max_position_embeddings 4096 --log_interval 1 --save_interval 200 --eval_interval 100 --eval_iters 10 --hidden_dropout 0.0 --position_embedding_type rotary --no_bias_dropout_fusion --use_checkpoint_args --attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5 --lr_decay_style cosine --lr_warmup_fraction 0.1 --lr 2e-5 --min_lr 2e-6 --weight_decay 0.1 --sequence_parallel --recompute_granularity selective --log_timers_to_tensorboard --scalar_loss_mask=0.0 --rope_scaling_factor 1.0 --variable_seq_lengths --data_type instruction --metrics all --finetune --train_iters 1000
EXTRA_ARGS=--vocab_file=/workspace/data/scratch/llama/tokenizer.model --use_rms_norm --glu_activation swiglu --no_tie_embed_logits --vocab_extra_ids_list [bib_ref],[/bib_ref],[fig_ref],[/fig_ref],[bib],[/bib],[fig],[/fig],[table],[/table],[formula],[/formula],<|im_start|>,<|im_end|> --vocab_file=/workspace/data/scratch/llama2/Llama-2-7b-hf/tokenizer.model --layernorm_epsilon 1e-5

Checkpoint not found to provide arguments, using provided arguments.
Traceback (most recent call last):
  File "/workspace/Megatron-LLM/finetune.py", line 259, in <module>
    initialize_megatron(extra_args, args_defaults)
  File "/workspace/data/Megatron-LLM/megatron/initialize.py", line 45, in initialize_megatron
    megatron.arguments.validate_args(args, args_defaults)
  File "/workspace/data/Megatron-LLM/megatron/arguments.py", line 221, in validate_args
    assert args.encoder_num_layers is not None, \
AssertionError: either num_layers or encoder_num_layers should be specified
Checkpoint not found to provide arguments, using provided arguments.
Traceback (most recent call last):
  File "/workspace/Megatron-LLM/finetune.py", line 259, in <module>
    initialize_megatron(extra_args, args_defaults)
  File "/workspace/data/Megatron-LLM/megatron/initialize.py", line 45, in initialize_megatron
    megatron.arguments.validate_args(args, args_defaults)
  File "/workspace/data/Megatron-LLM/megatron/arguments.py", line 221, in validate_args
    assert args.encoder_num_layers is not None, \
AssertionError: either num_layers or encoder_num_layers should be specified

My initial run parameters:

!python `pwd`/data/meditron/finetuning/sft.py \
    --checkpoint=meditron \
    --size=7 \
    --run_name=summarization \
    --data data/train.jsonl \
    --val data/test.jsonl \
    --micro_batch=1 \
    --nodes=1 \
    --save_interval=200 \
    --pp=1 \
    --seq 4096

Was able to track down my issue here. I was using HF checkpoints and it sounds like I need the Megatron-specific version of the weights. Documentation is here: https://epfllm.github.io/Megatron-LLM/guide/weights_conversion.html. I haven't gone through these steps yet, but this seems almost certainly my issue.

Sign up or log in to comment