Human Feedback Learning

#26
by awan12 - opened

Hi, I was wondering if you are willing to share more details about the human feedback training. In the report, section 4.1, you said that you generate 140k artist style images with SDXL for the style optimization-- why are images needed for this part as it seems the loss only relies on the prompt c? Or do you include the L_instance loss at the same time? Do you also include the diffusion loss at this stage? In Section 3.3 for one step generation, do you use all the data / losses (including Human feedback/structural) during training? thanks so much!

ByteDance org

Hi, @awan12
The style optimization would require generated images because the input to vgg network is image rather than prompt.
No pretrain loss is added during rlhf fine-tuning.
For more information, please refer to our UniFL (https://arxiv.org/abs/2404.05595).

Hi @Yanzuo ,
Thanks for the help and for the pointer to the paper! In the Hyper-SD report, I only saw mentioned the aesthetic loss based on ImageReward model, and the perceptual loss based on the segmentation model (which does also require the ground truth image). Does that mean that L_perceptual also includes a style component using VGG as in UniFL, and is it fairly critical?
Finally, for structural tuning, do you think it is important to have the ground truth segmentation annotation, or can the segmentation model be used for both generated and ground truth image? If annotation is needed, I guess the structual tuning is done separately from the aesthetic tuning as you don't have annotations for the SDXL generated images? Thank you so much again!

ByteDance org

Hi, @awan12
Sorry for the confusion and thank you for pointing out the errors in the article.
We have removed vgg loss after discovering that it was useless here. So actually no images were generated.
Instead, the perceptual loss used the GT images from COCO.
As for the structural tuning, we have also recently explored this aspect.
We also tried the approach you mentioned but it didn’t work well. So no better method has been found yet.
Thanks for your attention to our work!

Hi @Yanzuo !
Thanks for all the details , I really appreciate it. No errors in the article, just want to make sure I understand :) To make sure I understand correctly, there is the L_aesthetic (which incorporates ImageReward and only requires prompts) and L_perceptual (which uses the segmentation model + GT annotations). During human feedback training, there is no distillation loss. Is the model trained with L_aesthetic and L_perceptual at the same time over COCO, or is it in separate stages, possibly using another source of prompts for the L_aesthetic portion. Thank you so much for any additional details you can provide!

Sign up or log in to comment