Adding text-based context

by MoralHazard - opened 19 days ago

19 days ago

Do any of your models deal well with text context in addition to the task string? I would like to include image tags and terse character descriptions as context to improve the generations.

ljnlonoljpiljm

Owner 18 days ago

Hello! I’ve not trained models with that specific objective, though that’s not a bad idea.

The closest thing I can think of is the “florence-2-base-regions” model with the task <CAPTION_TO_PRASE_GROUNDING>, which works surprisingly well.

Both this model and the regions model should also produce pretty good results with the task <GPT4V_CAPTION>, if you’re looking for long form captions.

MoralHazard

18 days ago

Thanks! What is the syntax for adding context to ? just + prompt? Also, I'm trying to do this with NSFW images, so I've only tried and

I'm also considering training Florence-2 myself for this purpose if I can get access to an appropriate captioned dataset. Are you willing to share how much time/hardware it takes to train the base and large versions?

ljnlonoljpiljm

Owner 17 days ago

Just task + ‘ ‘ + caption/context! I pass them in as a single string prompt.

As for resources, I can do a batch size of two on a T4 in colab, both large and base models. Takes a decent bit longer to train the large model - I found that the extra params the large model offers isn’t worth the additional compute, but that’s just my perspective.

A callout, I’ve not trained only on “real” nsfw images (no anime) so depending on your use case, you may have varying results. I’ve also focused mainly on gay/male datasets (already so many straight NSFW models, why add another), so again, results may vary.

Also I HIGHLY recommend training in fp16, makes the process much faster, and if you’re quantizing (onnx/mlx), makes performance a bit more consistent.

ljnlonoljpiljm

Owner 17 days ago

I also recommend training on both NSFW and SFW datasets at the same time with high number of accumulation steps. I find it helps the model become more resilient to noise in the data and “blends” together the knowledge and style of captions.

I.E. train with a nsfw dataset and something like the gpt4v dataset with 128 - 512 accumulation steps and a batch size of 2

MoralHazard

16 days ago

Thanks so much! I've trained the base and Large model on the ToriiGate dataset with the tags and character descriptions as prompts and I'm getting some promising results with recognition of NSFW features and with characters being accurately identified in the caption (my main interest is getting who's doing what to whom, which cannot be inferred from booru tags).

I've been doing LoRa. Is it worth it to do a full fine-tuning?

ljnlonoljpiljm

Owner 16 days ago

I’ve personally not trained any loras and have only done full training.

Since Florence wasn’t trained in any anime (to my knowledge) or NSFW, it might be worthwhile to train the entire model, if you have the compute resources.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment