Trying IDEFICS on a New Yorker cartoon dataset

Community Article Published September 23, 2023

Earlier this year, I generated descriptions of cartoons using Salesforce's LAVIS library and instruct/multimodal model. Their concept was to combine a visual encoder (BLIP-2), Vicuna (LLaMa v1 + delta weights), and additional InstructBLIP weights. This was a bit tedious because of the limited release of LLaMa v1 and the need to re-assemble the model in memory.

IDEFICS is a complete multimodal instruct model from HuggingFace, and I'm excited to try it out on the same task.

First I got a A100 CoLab Pro notebook, and loaded the model as described on the IDEFICS blog post:

from transformers import IdeficsForVisionText2Text, AutoProcessor

model = IdeficsForVisionText2Text.from_pretrained(
  "HuggingFaceM4/idefics-9b-instruct",
  torch_dtype=torch.bfloat16,
).to(device)
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics-9b-instruct")

Here's how to grab the New Yorker cartoons and caption-matching dataset:

from datasets import load_dataset
cartoons = load_dataset("jmhessel/newyorker_caption_contest", "matching")

I can grab a single PIL image from the dataset like this:

image = None
for cartoon in cartoons['train']:
  image = cartoon['image']
  #print(cartoon['caption_choices'])
  break
image # at end if you're doing CoLab, to see the image yourself

In the previous project I had time to think about prompts. Most image captioning examples use a variation of "Describe the image in detail" - but because this already has a distinct style as a New Yorker cartoon, you usually just get those facts in each responses. If we want enough information to provide hints and make it easier to pick a matching joke or caption for the image, I state that it's a cartoon, ask for specifics, and hint at an interesting premise.

Here's the prompt in instruct / chat format:

prompts = [
    [
        "User: Describe all characters and setting of this cartoon in detail. It may be sardonic or absurdist.",
        image,
        "<end_of_utterance>",
    ],
]

Let's follow through with the rest of the generation code from the HF blog:

inputs = processor(prompts, return_tensors="pt").to(device)
bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

generated_ids = model.generate(**inputs, bad_words_ids=bad_words_ids, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
    print(f"{i}:\n{t}\n")

The result:

User: Describe all characters and setting of this cartoon in detail. It may be sardonic or absurdist.

Assistant: The image features two giraffes standing in a living room. One giraffe is sitting on a couch, while the other is standing near a coffee table. The living room is furnished with a TV mounted on the wall, a lamp on a side table, and a potted plant on the floor

Passes the vibe check!

CoLab link: https://colab.research.google.com/drive/15kd17YRdbVayggA-ZCYiXTYzZG4w8zUd?usp=sharing

Future work

Admittedly the first few cartoon captions were not as accurate, so I did a little cherry-picking, but they were all in the right zone. I think there's opportunity to reword the prompt or do a few-shot example with this chat-like format.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote