Is it possible for me to train/ finetune this model on a dataset of image text pairs

#3
by ghpkishore - opened

I want to generate images specific to a purpose. The current large-scale models are not that great at generating images for social media ads ( a process that I want to automate). I wanted to know if there was a way I can finetune this model on a dataset consisting of 1000 images caption pairs of social media ads and then try generating images using it.

Got this as a statement when I ran the model in colab:

Some weights of AutoencoderKL were not initialized from the model checkpoint at /root/.cache/huggingface/diffusers/models--CompVis--ldm-text2im-large-256/snapshots/a1f6625cc7e0f69a350ad16a157263409979f533/vqvae and are newly initialized: ['encoder.down.0.downsample.Conv2d_0.bias', 'encoder.down.2.downsample.Conv2d_0.bias', 'encoder.down.1.downsample.Conv2d_0.weight', 'encoder.down.0.downsample.Conv2d_0.weight', 'encoder.down.2.downsample.Conv2d_0.weight', 'encoder.down.1.downsample.Conv2d_0.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference

We working on it with @anton-l :-) We'll keep you posted!

Hi @patrickvonplaten ,

Basically what I had in mind was this:

Problem Statement:
"Creating ads for Social Media using diffusion models like ldm or any other model based on prompts. "

Parameters (input):
The Prompt will contain the text the image needs to have, the occasion of the image, Iconography of the company such as its Logo. The Target audience (say Hispanic, Asian, etc), Gender, occasion (example: New Year, Independence Day etc.)

Output:
A set of images that correspond to the input of different variety, so that the company can then choose which works.

Is it possible to do this? If so, can I understand how to?

Marking @multimodalart as well incase they can help me out

Sign up or log in to comment