Adding text-based context

#1
by MoralHazard - opened

Do any of your models deal well with text context in addition to the task string? I would like to include image tags and terse character descriptions as context to improve the generations.

Hello! I’ve not trained models with that specific objective, though that’s not a bad idea.

The closest thing I can think of is the “florence-2-base-regions” model with the task <CAPTION_TO_PRASE_GROUNDING>, which works surprisingly well.

Both this model and the regions model should also produce pretty good results with the task <GPT4V_CAPTION>, if you’re looking for long form captions.

Thanks! What is the syntax for adding context to ? just + prompt? Also, I'm trying to do this with NSFW images, so I've only tried and

I'm also considering training Florence-2 myself for this purpose if I can get access to an appropriate captioned dataset. Are you willing to share how much time/hardware it takes to train the base and large versions?

Just task + ‘ ‘ + caption/context! I pass them in as a single string prompt.

As for resources, I can do a batch size of two on a T4 in colab, both large and base models. Takes a decent bit longer to train the large model - I found that the extra params the large model offers isn’t worth the additional compute, but that’s just my perspective.

A callout, I’ve not trained only on “real” nsfw images (no anime) so depending on your use case, you may have varying results. I’ve also focused mainly on gay/male datasets (already so many straight NSFW models, why add another), so again, results may vary.

Also I HIGHLY recommend training in fp16, makes the process much faster, and if you’re quantizing (onnx/mlx), makes performance a bit more consistent.

I also recommend training on both NSFW and SFW datasets at the same time with high number of accumulation steps. I find it helps the model become more resilient to noise in the data and “blends” together the knowledge and style of captions.

I.E. train with a nsfw dataset and something like the gpt4v dataset with 128 - 512 accumulation steps and a batch size of 2

Thanks so much! I've trained the base and Large model on the ToriiGate dataset with the tags and character descriptions as prompts and I'm getting some promising results with recognition of NSFW features and with characters being accurately identified in the caption (my main interest is getting who's doing what to whom, which cannot be inferred from booru tags).

I've been doing LoRa. Is it worth it to do a full fine-tuning?

I’ve personally not trained any loras and have only done full training.

Since Florence wasn’t trained in any anime (to my knowledge) or NSFW, it might be worthwhile to train the entire model, if you have the compute resources.

Sign up or log in to comment