Save and restore from checkpoint

#3
by sanchit-gandhi HF staff - opened

Question from @Martha-987 :

how can i save and restore the training checkpoints when the connection is lost in google colab? colab disconnect run before i finish my training ...i want to save my training and i resume it in another colab session

Did you set push_to_hub=True when training the first time? If so, you should be able to find your intermediate model checkpoints on the Hub.

Let's say they got pushed to "Martha-987/my-whisper-model" and the last checkpoint that saved was "checkpoint-2000". You can resume training from your last checkpoint by loading the model from this pre-trained checkpoint:

model = WhisperForConditionalGeneration.from_pretrained("Martha-987/my-whisper-model/checkpoint-2000")

When you launch training, simply set:

trainer.train(resume_from_checkpoint=True)

However, if you did not reach the first save_step, you'll need to restart training unfortunately :/

how can i reach the first save_step while training stopped becouse of colab limit ? or becouse of diconnect run time ? and i can't train without GPU as my data is 38000 sample :/

Thanks prof: sanchit for your cooperation ... please i have question ,what does checkpoint 2000 means in
model = WhisperForConditionalGeneration.from_pretrained("Martha-987/my-whisper-model/checkpoint-2000") ...
i applied it and gave me error

how can i reach the first save_step while training stopped becouse of colab limit ?

You can try using save_steps=250 and eval_steps=250 to save and evaluate more frequently

You can also try loading just 25% of your training data if the corpus is large (see https://huggingface.co/docs/datasets/loading#slice-splits):

train_10pct_ds = datasets.load_dataset("bookcorpus", split="train[:25%]")

Otherwise you might need Colab pro or a local GPU.

If you want to fine-tune Whisper with free GPU compute, sign up to the Whisper fine-tuning event! https://twitter.com/sanchitgandhi99/status/1592188332171493377

Hey @Taqwa - checkpoint_2000 would be saved if 2000 steps of training were completed. In @Martha-987 's case, we did not get to 2000 train steps, so the checkpoint wasn't saved.

@sanchit-gandhi thanks for your help
but if i finish training in what file (what is name of file the checkpoint saved? )

In the Seq2SeqTrainingArguments, You will have set "output_dir" to a repo name of your choice, for example "whisper-small-hi".

Training will save locally to "whisper-small-hi".

If you set push_to_hub=True, it will also save to "Martha-987/whisper-small-hi".

Hope that answers your question!

@sanchit-gandhi
thanks alot ,but when i try to push to hub i found this error

(remote: Sorry, your push was rejected during YAML metadata verification:
remote: - Error: "model-index[0].results[0].dataset.config" must be a string )

why i found it ?

Can you open the "README.md" file created and paste the contents? I think some of the info is malformatted. We can fix this :)

@sanchit-gandhi
"README.md " not created unfortunately /:

@sanchit-gandhi
Capture1.JPG
Capture2.JPG
and Model card also :/
can you tell me what is the problem please?

The README.md will be saved locally under the folder "whisper-small-ar"

The error message you've sent means that something is formatted incorrectly on the README - we can't push the README.md to Hub while this is the case.

If you copy and paste say the first 50 lines of the README I can point out where it's wrong and we can fix it for you

This comment has been hidden

@sanchit-gandhi
thanks alot i solve the problem of README.md file .
but now i create model with max step=500 &save_steps=125 and eval_steps=125 and it created sucessufully,
My new problem is how i resume the training from last checkpoint-500 while my model doesn't save the folder of the checkpoint-500 so i have an error
error.PNG

Hey @Martha-987 ! Cool to see that the README upload has worked 🤗 Usually, Trainer would create subdirectories in the model repo at each save step, e.g. if you save every 125 steps, there would be subdirectoriescheckpoint-125, checkpoint-250, checkpoint-375, etc

Because of the model card issue it might be that the Trainer upload was affected. I can't see this folder structure in your repo https://huggingface.co/Martha-987/whisper-small-Arabic-aar/tree/main

Either way, it looks like the weights (pytorch_model.bin) were uploaded at the last commit 'End of training'

You can load these weights simply through:

model = WhisperForConditionalGeneration.from_pretrained("Martha-987/whisper-small-Arabic-aar")

Hope that answers your question!

deleted

Hey prof @sanchit-gandhi please, I face the same problem as martha .when i run the model on 1000 or 500 and i need to start from this checkpoint it gaves me error although the folder of checkpoint 125,250,375 was created and the pytorch model.bin created.... where the problem?
are you mean the last checkpoint are saved in pytorch_model.bin?
please can you tell me what the problem of model card because i have the same prob as martha?
thanks prof ...

@sanchit-gandhi
do you mean that
model = WhisperForConditionalGeneration.from_pretrained("Martha-987/whisper-small-Arabic-aar") will load last checkpoint i do ??
and how to solve the problem of model card to upload all folders?
i save the checkpoint-500 folder in my device manually but can't to load it to my model as folder :/
can u help me ? :/

Hey @Taqwa and @Martha-987 ,

The HF Trainer creates sub-directories for the model weights during training, where the subdirectories correspond to the checkpoints at which we save the weights (whisper-small-Arabic-aar/checkpoint-125, whisper-small-Arabic-aar/checkpoint-250, whisper-small-Arabic-aar/checkpoint-375, etc). At the end of training, the 'best' model weights are saved to the root directory (whisper-small-Arabic-aar). These model weights are under the file pytorch_model.bin. So if we load from Martha-987/whisper-small-Arabic-aar, we'll load the 'best' weights from your previous model training! Does that help answer your question?

and how to solve the problem of model card to upload all folders?

Could you try installing Transformers from main? The next time you create a model card during training, it should upload correctly 🤗:

pip uninstall transformers
pip install git+https://github.com/huggingface/transformers

Or in a Colab cell:

! pip uninstall transformers
! pip install git+https://github.com/huggingface/transformers

If you have access to a pre-existing model card on your local device (README.md) that you'd like to upload, could you copy and paste the contents here? I can direct you as to what changes you require in order to upload it.

@sanchit-gandhi

Thanks alot for your help
Yes i see the folders of checkpoints but it isn't pushed to hup with others files and when session disconnected can't save it
so when i use .(/whisper-small-Arabic-aar/checkpoint-500) in another session it give me error .

how can I get checkpoints folders ? should i download the best manually on my local device? and upload it to model when i open another session ?

This comment has been hidden

here is me README.md

>>>
language:
- ar
license: apache-2.0
tags:
- hf-asr-leaderboard
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
model-index:
- name: Whisper Small Ar- Martha
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 11.0
      type: mozilla-foundation/common_voice_11_0
      args: 'config: ar, split: test'
    metrics:
    - name: Wer
      type: wer
      value: 50.11110090900743
>>>

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Whisper Small Ar- Martha

This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Common Voice 11.0 dataset.
It achieves the following results on the evaluation set:
- Loss: 0.3743
- Wer: 50.1111

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 1000
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step | Validation Loss | Wer     |
|:-------------:|:-----:|:----:|:---------------:|:-------:|
| 0.26          | 0.42  | 1000 | 0.3743          | 50.1111 |


### Framework versions

- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu116
- Datasets 2.7.1
- Tokenizers 0.13.2

Could you replace the bit between --- and --- with:

language:
- ar
license: apache-2.0
tags:
- hf-asr-leaderboard
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
model-index:
- name: Whisper Small Ar- Martha
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 11.0
      type: mozilla-foundation/common_voice_11_0
      config: ar
      split: test
    metrics:
    - name: Wer
      type: wer
      value: 50.11110090900743

and then save the updated README.md file? This will fix the README.md such that you can upload it to the Hub.

Then you can push to hub manually using the command line:

cd /PATH/TO/YOUR/REPO
git add .
git commit -m "add readme"
git push

Or in a Google Colab code cell:

!cd /PATH/TO/YOUR/REPO
!git add .
!git commit -m "add readme"
!git push

how can I get checkpoints folders ?

This should happen automatically!

Sign up or log in to comment