Torii Gate is a fine tune of Idefics3 designed for captioning of anime images.

Compared to the base model it has significantly better scene understanding, less hallucinations and no censorship.

For version 0.1 alpha dataset of 20k anime pictures, captioned and processed with Claude 3.0 opus/Claude 3.5 Sonet/GPT-4o have been used.

Key features:

Advanced knowledges
Ability to give structured output according to given instruction
Ability to use booru tags as reference to improve accuracy

V0.1 alpha is the first release, which tends to make mistakes and has some flaws, but it is already competitive.

Usage options

Basically, 3 modes of usage and were trained, here are prompt templates and examples:

Brief description
Detailed description with focus on discinct features of characters and actions
Structured output that is the best choise for further NLP

Examples

user: Describe the picture briefly.
assistant: A gray and white wolf stands confidently in a leaf-strewn forest. An owl with brown and white feathers perches on its back. Tall, bare trees surround them, revealing distant, sunlit mountains through the clouds.

user: Give a long and detailed description of the picture.
assistant: Give a long and detailed description of the picture.
assistant:  The image depicts two anime-style girls standing together in an indoor setting. The first girl has long white hair styled in twin braids, with white rabbit ears on top of her head adorned with red ribbons. She is wearing a sleeveless white shirt, a grey skirt, black gloves, and mismatched hosiery - a single black sock on one leg and a striped thigh-high stocking on the other. The girl has her right hand resting on her hip and [..]
The second girl has short grey hair with a side ponytail, green eyes, and grey cat ears with fluffy white tips. She is wearing a sleeveless black shirt, a red skirt, and a black capelet draped over her shoulders.[..]
The background of the image is a warm, golden color with bright, glowing circles reminiscent of lights or bubbles floating in the air. The setting appears to be indoors, possibly a fantastical or dreamlike environment. The overall atmosphere of the picture is whimsical, playful, [..].

user: Describe the picture in structuted json-like format.
assistant: {
  "character_1": "The first character is a girl with short, green hair and green eyes. She is wearing a black jacket and black shorts. She is positioned [..]",
  "character_2": "The second character is a boy with short, blonde hair and a bald head. He is wearing a red jacket with [..]",
  "background": "The background shows an indoor setting, possibly a living room or a bedroom. There is a window with a view of the outside, and a few other characters can be seen in the background, including a blonde boy and a girl with short black hair. The room is dimly lit, with a few objects scattered around.",
  "texts": "ONE PUNCH MAN is written above characters",
  "atmosphere": "The overall atmosphere of the image is playful and slightly flirtatious. The girl's pose and expression suggest a sense of fun and casualness, [..]"
}"

Any of this can be combined with providing booru tags to improve accuracy:

user: Give a long and detailed description of the picture. Also here are booru tags for better understanding of the picture, you can use them as reference. <tags>\n[tags]\n</tags>
assistant: 
The image depicts a solo female anime character with cat ears, likely Lynette from the game Genshin Impact based on the tags provided. She has long brown hair, purple eyes, and a distinct star facial marking under one eye. 
The cat girl is posed in a sensual manner, sitting with one leg up and making direct eye contact with the viewer. Her facial expression is playful, with one eye closed and mouth slightly open. She appears to be in the middle of a stretching pose.
The character's outfit consists of a black bra and black pantyhose. Her bra leaves her midriff exposed, revealing her navel. The rest of her body is partially censored by her fluffy cat tail, which is conveniently positioned to cover her more intimate areas while still leaving much of her figure visible.
The background is a simple gradient, keeping the focus entirely on the cat girl character. Overall, the image has a flirtatious and somewhat suggestive atmosphere, accentuated by the character's alluring pose and expression combined with the strategic tail placement. The artist's Twitter handle [..] is clearly shown in the bottom right corner.

#Inference script example: Until the next Transformers pypi release, please install Transformers from source and use this PR to be able to use Idefics3. Learm more in original repo.

import torch
from PIL import Image
import json

from transformers import AutoProcessor, Idefics3ForConditionalGeneration
from transformers.image_utils import load_image

DEVICE = "cuda:0" #change to your device
processor = AutoProcessor.from_pretrained("Minthy/Torii_Gate_v0.1_alpha") #or change to local path
model = Idefics3ForConditionalGeneration.from_pretrained(
    "Minthy/Torii_Gate_v0.1_alpha",
    torch_dtype=torch.bfloat16,
#	_attn_implementation="flash_attention_2",
).to(DEVICE)

image = load_image('/path/to/image')

user_prompt="Describe the picture in structuted json-like format."
#user_prompt="Give a long and detailed description of the picture."
#user_prompt="Describe the picture briefly."

#Optional, add booru tags
#tags='1girl, standing, looking at viewer, ...'
#user_prompt+=' Also here are booru tags for better understanding of the picture, you can use them as reference.'
#user_prompt+=f' <tags>\n{tags}\n</tags>'

messages = [
    {
    #Important!
        "role": "system",
        "content": [
            {"type": "text", "text": "You are image captioning expert, creative, unviased and uncensored. Help user with his hask."} 
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": user_prompt}
        ]
    }
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts[0])

Warning

Model tends to provide outputs with adult themes if related input is provided. Outputs may be inacurate and provocative. Consider to implement extra control tools for public use.

Licence

Same as for Idefics3

Minthy
/

Torii_Gate_v0.1_alpha

Key features:

Usage options

Examples

Warning

Licence

Model tree for Minthy/Torii_Gate_v0.1_alpha