Training Info

by alexgenovese - opened

Hello there,

I really like your job. I want to ask you what was the training process to have this result. I suppose is based on a sort of manual tagging a subset of photos?

I mean:

  • 50 Angle 1 photos with CLIP caption + TAG1 (...a1...)
  • 50 Angle 2 photos with CLIP caption + TAG2 (...a2...)

Thank you for sharing it.

Sorry took a while to respond. I haven't been on in a while so just saw this message. Training consisted of me gathering images of people in various camera shots and angles. Then I sorted all of the images so that each group of images had only one camera angle and shot. Then I made a hypernetwork for only that camera shot and angle. Rince/repeat that process 30 times to cover all shots and angles within 180 degrees. So now I had 30 hypernetworks. I then created the most photoreal checkpoint model I could by merging over 20 different models and experimenting a lot along the way. Then I made wildcards in helping to simplify the process of putting the whole together and labeling the camera shots depending on their angle and shot (Distance to subject). Captioning was done manually. I used a self imposed captioning schema to label every image in natural spoken English fashion for ease of use. I made sure and described every image in great detail for maximum capability in flexibility during inference. Anyways there was a lot of other stuff I did along the way, but I've rambled long enough. This was a passion project that took a year and tens of thousands of images and way too much of my time and money lol... I learned a lot making Prometheus.

No worries, thank you for sharing it

Sign up or log in to comment