Multimodal IDEFICS: Unveiling the Transparency & Power of Open Visual Language Models

Community Article Published January 8, 2024

Introduction

In a landscape where innovation is frequently obscured by intricate proprietary walls, the advent of IDEFICS—short for I mage-aware D ecoder E nhanced à la F lamingo with I ninterleaved C ross-attention S —represents a breakthrough in open visual language models.

Definitions:

Hugging Face: A platform renowned for democratizing access to AI models, fostering collaboration, and facilitating the dissemination of groundbreaking technologies.
IDEFICS Model: Representing the pinnacle of open-source visual language models, IDEFICS stands as an 80-billion parameter model with the remarkable ability to process both image and text sequences. Its foundation, rooted in Flamingo, brings forth a parallel to GPT-4, offering a versatile platform for generating coherent textual outputs from diverse inputs.

Advantages of IDEFICS:

Transparency Initiatives: IDEFICS champions transparency by leveraging solely publicly available data and models, distinguishing itself from closed-source counterparts. The commitment to open training data exploration tools and the disclosure of technical challenges encountered during development fosters a collaborative ethos within the AI community.
Adversarial Prompt Evaluation: A notable facet of IDEFICS lies in its proactive approach to mitigating harmfulness. The meticulous assessment through adversarial prompts prior to release underscores its commitment to ethical AI development.

Applications of IDEFICS:

IDEFICS possesses a wide array of applications, particularly excelling in tasks that intertwine image and text inputs. Some prominent applications include:

Visual Question Answering : IDEFICS adeptly answers questions based on images, rendering it invaluable for image-based quizzes and information retrieval.
Image Captioning:Capable of generating descriptive captions for images, IDEFICS significantly enhances accessibility and comprehension of visual content.
Story Generation:Leveraging multiple images, IDEFICS crafts narratives and stories, showcasing its creative potential in storytelling applications.
Text Generation:Beyond its primary multimodal focus, IDEFICS showcases the ability to generate text independently of visual inputs, demonstrating versatility across various natural language understanding and generation tasks.
Custom Data Fine-tuning:Users can fine-tune the base models with custom data, tailoring IDEFICS' responses to cater to specific use cases.
Instruction Following: Tailored instructed versions of IDEFICS excel in following user instructions, making them ideal for applications in chatbots and conversational AI.

These applications underscore the versatility and adaptability of IDEFICS, positioning it as a multifaceted tool capable of catering to a diverse range of tasks merging visual and textual inputs.

Code Implementation:

The integration of IDEFICS into the Hugging Face Hub and support within the latest version of transformers signifies a crucial milestone. Utilizing IDEFICS becomes more accessible with a simple code example:

import torch
from transformers import IdeficsForVisionText2Text, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

checkpoint = "HuggingFaceM4/idefics-9b-instruct"
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained(checkpoint)

# We feed to the model an arbitrary sequence of text strings and images. Images can be either URLs or PIL Images.
prompts = [
    [
        "User: What is in this image?",
        "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
        "<end_of_utterance>",

        "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

        "\nUser:",
        "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
        "And who is that?<end_of_utterance>",

        "\nAssistant:",
    ],
]

# --batched mode
inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
# --single sample mode
# inputs = processor(prompts[0], return_tensors="pt").to(device)

# Generation args
exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
    print(f"{i}:\n{t}\n")

Conclusion:

IDEFICS isn't merely a model; it's a testament to the potential unlocked through collaborative openness. As it stands on the shoulders of Flamingo, it beckons the AI community to embrace transparency, ethical AI, and a shared vision of progress.

The journey of IDEFICS is not a solitary one; it's a collective endeavor, inviting enthusiasts and experts alike to shape the future of open AI models.

“Stay connected and support my work through various platforms:

Medium: You can read my latest articles and insights on Medium at https://medium.com/@andysingal

Paypal: Enjoyed my article? Buy me a coffee! https://paypal.me/alphasingal?country.x=US&locale.x=en_US"

Requests and questions: If you have a project in mind that you’d like me to work on or if you have any questions about the concepts I’ve explained, don’t hesitate to let me know. I’m always looking for new ideas for future Notebooks and I love helping to resolve any doubts you might have.

Resources:

Upvote