Image tasks with IDEFICS

While individual tasks can be tackled by fine-tuning specialized models, an alternative approach that has recently emerged and gained popularity is to use large models for a diverse set of tasks without fine-tuning. For instance, large language models can handle such NLP tasks as summarization, translation, classification, and more. This approach is no longer limited to a single modality, such as text, and in this guide, we will illustrate how you can solve image-text tasks with a large multimodal model called IDEFICS.

IDEFICS is an open-access vision and language model based on Flamingo, a state-of-the-art visual language model initially developed by DeepMind. The model accepts arbitrary sequences of image and text inputs and generates coherent text as output. It can answer questions about images, describe visual content, create stories grounded in multiple images, and so on. IDEFICS comes in two variants - 80 billion parameters and 9 billion parameters, both of which are available on the 🤗 Hub. For each variant, you can also find fine-tuned instructed versions of the model adapted for conversational use cases.

This model is exceptionally versatile and can be used for a wide range of image and multimodal tasks. However, being a large model means it requires significant computational resources and infrastructure. It is up to you to decide whether this approach suits your use case better than fine-tuning specialized models for each individual task.

In this guide, you’ll learn how to:

Load IDEFICS and load the quantized version of the model
Use IDEFICS for:
Run inference in batch mode
Run IDEFICS instruct for conversational use

Before you begin, make sure you have all the necessary libraries installed.

pip install -q bitsandbytes sentencepiece accelerate transformers

To run the following examples with a non-quantized version of the model checkpoint you will need at least 20GB of GPU memory.

Loading the model

Let’s start by loading the model’s 9 billion parameters checkpoint:

>>> checkpoint = "HuggingFaceM4/idefics-9b"

Just like for other Transformers models, you need to load a processor and the model itself from the checkpoint. The IDEFICS processor wraps a LlamaTokenizer and IDEFICS image processor into a single processor to take care of preparing text and image inputs for the model.

>>> import torch

>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, dtype=torch.bfloat16, device_map="auto")

Setting device_map to "auto" will automatically determine how to load and store the model weights in the most optimized manner given existing devices.

Quantized model

If high-memory device availability is an issue, you can load the quantized version of the model. To load the model and the processor in 4bit precision, pass a BitsAndBytesConfig to the from_pretrained method and the model will be compressed on the fly while loading.

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig

>>> quantization_config = BitsAndBytesConfig(
...     load_in_4bit=True,
...     bnb_4bit_compute_dtype=torch.float16,
... )

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(
...     checkpoint,
...     quantization_config=quantization_config,
...     device_map="auto"
... )

Now that you have the model loaded in one of the suggested ways, let’s move on to exploring tasks that you can use IDEFICS for.

Image captioning

Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired people navigate through different situations, for instance, explore image content online.

To illustrate the task, get an image to be captioned, e.g.:

Photo by Hendo Wang.

IDEFICS accepts text and image prompts. However, to caption an image, you do not have to provide a text prompt to the model, only the preprocessed input image. Without a text prompt, the model will start generating text from the BOS (beginning-of-sequence) token thus creating a caption.

As image input to the model, you can use either an image object (PIL.Image) or a url from which the image can be retrieved.

>>> prompt = [
...     "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
A puppy in a flower bed

It is a good idea to include the bad_words_ids in the call to generate to avoid errors arising when increasing the max_new_tokens: the model will want to generate a new <image> or <fake_token_around_image> token when there is no image being generated by the model. You can set it on-the-fly as in this guide, or store in the GenerationConfig as described in the Text generation strategies guide.

Prompted image captioning

You can extend image captioning by providing a text prompt, which the model will continue given the image. Let’s take another image to illustrate:

Photo by Denys Nevozhai.

Textual and image prompts can be passed to the model’s processor as a single list to create appropriate inputs.

>>> prompt = [
...     "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...     "This is an image of ",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
This is an image of the Eiffel Tower in Paris, France.

Few-shot prompting

While IDEFICS demonstrates great zero-shot results, your task may require a certain format of the caption, or come with other restrictions or requirements that increase task’s complexity. Few-shot prompting can be used to enable in-context learning. By providing examples in the prompt, you can steer the model to generate results that mimic the format of given examples.

Let’s use the previous image of the Eiffel Tower as an example for the model and build a prompt that demonstrates to the model that in addition to learning what the object in an image is, we would also like to get some interesting information about it. Then, let’s see, if we can get the same response format for an image of the Statue of Liberty:

Photo by Juan Mayobre.

>>> prompt = ["User:",
...            "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...            "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
...            "User:",
...            "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
...            "Describe this image.\nAssistant:"
...            ]

>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
User: Describe this image.
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. 
User: Describe this image.
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.

Notice that just from a single example (i.e., 1-shot) the model has learned how to perform the task. For more complex tasks, feel free to experiment with a larger number of examples (e.g., 3-shot, 5-shot, etc.).

Visual question answering

Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. Similar to image captioning it can be used in accessibility applications, but also in education (reasoning about visual materials), customer service (questions about products based on images), and image retrieval.

Let’s get a new image for this task:

Photo by Jarritos Mexican Soda.

You can steer the model from image captioning to visual question answering by prompting it with appropriate instructions:

>>> prompt = [
...     "Instruction: Provide an answer to the question. Use the image to answer.\n",
...     "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...     "Question: Where are these people and what's the weather like? Answer:"
... ]

>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Provide an answer to the question. Use the image to answer.
 Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.

Image classification

IDEFICS is capable of classifying images into different categories without being explicitly trained on data containing labeled examples from those specific categories. Given a list of categories and using its image and text understanding capabilities, the model can infer which category the image likely belongs to.

Say, we have this image of a vegetable stand:

Photo by Peter Wendt.

We can instruct the model to classify the image into one of the categories that we have:

>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
...     "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",    
...     "Category: "
... ]

>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
Category: Vegetables

In the example above we instruct the model to classify the image into a single category, however, you can also prompt the model to do rank classification.

Image-guided text generation

For more creative applications, you can use image-guided text generation to generate text based on an image. This can be useful to create descriptions of products, ads, descriptions of a scene, etc.

Let’s prompt IDEFICS to write a story based on a simple image of a red door:

Image of a red door with a pumpkin on the steps

Photo by Craig Tidball.

>>> prompt = ["Instruction: Use the image to write a story. \n",
...     "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
...     "Story: \n"]

>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0]) 
Instruction: Use the image to write a story. 
 Story: 
Once upon a time, there was a little girl who lived in a house with a red door.  She loved her red door.  It was the prettiest door in the whole world.

One day, the little girl was playing in her yard when she noticed a man standing on her doorstep.  He was wearing a long black coat and a top hat.

The little girl ran inside and told her mother about the man.

Her mother said, “Don’t worry, honey.  He’s just a friendly ghost.”

The little girl wasn’t sure if she believed her mother, but she went outside anyway.

When she got to the door, the man was gone.

The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.

He was wearing a long black coat and a top hat.

The little girl ran

Looks like IDEFICS noticed the pumpkin on the doorstep and went with a spooky Halloween story about a ghost.

For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help you significantly improve the quality of the generated output. Check out Text generation strategies to learn more.

Running inference in batch mode

All of the earlier sections illustrated IDEFICS for a single example. In a very similar fashion, you can run inference for a batch of examples by passing a list of prompts:

>>> prompts = [
...     [   "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
... ]

>>> inputs = processor(prompts, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i,t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n") 
0:
This is an image of the Eiffel Tower in Paris, France.

1:
This is an image of a couple on a picnic blanket.

2:
This is an image of a vegetable stand.

IDEFICS instruct for conversational use

For conversational use cases, you can find fine-tuned instructed versions of the model on the 🤗 Hub: HuggingFaceM4/idefics-80b-instruct and HuggingFaceM4/idefics-9b-instruct.

These checkpoints are the result of fine-tuning the respective base models on a mixture of supervised and instruction fine-tuning datasets, which boosts the downstream performance while making the models more usable in conversational settings.

The use and prompting for the conversational use is very similar to using the base models:

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct"
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, dtype=torch.bfloat16, device_map="auto")
>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> prompts = [
...     [
...         "User: What is in this image?",
...         "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
...         "<end_of_utterance>",

...         "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

...         "\nUser:",
...         "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
...         "And who is that?<end_of_utterance>",

...         "\nAssistant:",
...     ],
... ]

>>> # --batched mode
>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(model.device)
>>> # --single sample mode
>>> # inputs = processor(prompts[0], return_tensors="pt").to(model.device)

>>> # Generation args
>>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i, t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n")

< > Update on GitHub