Trying IDEFICS on a New Yorker cartoon dataset

Community Article Published September 23, 2023

Earlier this year, I generated descriptions of cartoons using Salesforce's LAVIS library and instruct/multimodal model. Their concept was to combine a visual encoder (BLIP-2), Vicuna (LLaMa v1 + delta weights), and additional InstructBLIP weights. This was a bit tedious because of the limited release of LLaMa v1 and the need to re-assemble the model in memory.

IDEFICS is a complete multimodal instruct model from HuggingFace, and I'm excited to try it out on the same task.

First I got a A100 CoLab Pro notebook, and loaded the model as described on the IDEFICS blog post:

from transformers import IdeficsForVisionText2Text, AutoProcessor

model = IdeficsForVisionText2Text.from_pretrained(
  "HuggingFaceM4/idefics-9b-instruct",
  torch_dtype=torch.bfloat16,
).to(device)
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics-9b-instruct")

Here's how to grab the New Yorker cartoons and caption-matching dataset:

from datasets import load_dataset
cartoons = load_dataset("jmhessel/newyorker_caption_contest", "matching")

I can grab a single PIL image from the dataset like this:

image = None
for cartoon in cartoons['train']:
  image = cartoon['image']
  #print(cartoon['caption_choices'])
  break
image # at end if you're doing CoLab, to see the image yourself

In the previous project I had time to think about prompts. Most image captioning examples use a variation of "Describe the image in detail" - but because this already has a distinct style as a New Yorker cartoon, you usually just get those facts in each responses. If we want enough information to provide hints and make it easier to pick a matching joke or caption for the image, I state that it's a cartoon, ask for specifics, and hint at an interesting premise.

Here's the prompt in instruct / chat format:

prompts = [
    [
        "User: Describe all characters and setting of this cartoon in detail. It may be sardonic or absurdist.",
        image,
        "<end_of_utterance>",
    ],
]

Let's follow through with the rest of the generation code from the HF blog:

inputs = processor(prompts, return_tensors="pt").to(device)
bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

generated_ids = model.generate(**inputs, bad_words_ids=bad_words_ids, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
    print(f"{i}:\n{t}\n")

The result:

giraffe cartoon from The New Yorker

User: Describe all characters and setting of this cartoon in detail. It may be sardonic or absurdist.

Assistant: The image features two giraffes standing in a living room. One giraffe is sitting on a couch, while the other is standing near a coffee table. The living room is furnished with a TV mounted on the wall, a lamp on a side table, and a potted plant on the floor

Passes the vibe check!

CoLab link: https://colab.research.google.com/drive/15kd17YRdbVayggA-ZCYiXTYzZG4w8zUd?usp=sharing

Future work

Admittedly the first few cartoon captions were not as accurate, so I did a little cherry-picking, but they were all in the right zone. I think there's opportunity to reword the prompt or do a few-shot example with this chat-like format.