Using official training example, model was neither saved nor pushed to repo

#12

by nakajimayoshi - opened May 26, 2023

May 26, 2023

•

edited May 28, 2023

Hello, I am working on training a model based on the official training example which can be located here: https://huggingface.co/nakajimayoshi/ddpm-iris-256/tree/main/

I was able to successfully train the model, and the training logs/samples were successfully uploaded, but the model was neither saved in the runtime as a .bin or .pth or pushed to my repository. I have made no modifications to the training loop, only the training config and dataset loading pipeline. You can see the modification of the training config below:

from dataclasses import dataclass

@dataclass
class TrainingConfig:
    image_size = 256  # the generated image resolution
    train_batch_size = 16
    eval_batch_size = 16  # how many images to sample during evaluation
    num_epochs = 50
    gradient_accumulation_steps = 1
    learning_rate = 1e-4
    lr_warmup_steps = 500
    save_image_epochs = 10
    dataset_name= 'imagefolder'
    save_model_epochs = 30
    mixed_precision = 'fp16'  # `no` for float32, `fp16` for automatic mixed precision
    output_dir = 'ddpm-iris-256'  # the model namy locally and on the HF Hub

    push_to_hub = True  # whether to upload the saved model to the HF Hub
    hub_private_repo = False
    overwrite_output_dir = True  # overwrite the old model when re-running the notebook
    seed = 0

config = TrainingConfig()

On my repository, you can see the logs and samples were uploaded, but none of the model checkpoints were uploaded nor can I find them in my google colab notebook. Any help is appreciated. Thanks

nakajimayoshi

May 26, 2023

•

edited May 26, 2023

I have found a work around for this issue:
The issue is in the training loop:

 if accelerator.is_main_process:
            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)

            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
                evaluate(config, epoch, pipeline)

            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
                if config.push_to_hub:
                    repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
                else:
                    pipeline.save_pretrained(config.output_dir) # this never gets called

For one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:

 if accelerator.is_main_process:
            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
            pipeline.save_pretrained(config.output_dir) # move to here
            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
                evaluate(config, epoch, pipeline)

            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
                if config.push_to_hub:
                    repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
                else:
                    print('saving..') # replaced with print to see if it gets called

note I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.
This slows down the training speed but at the very least the model doesn't get lost.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment