MIDEFICS

Midefics-Obelics logo

MIDEFICS (Medical Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS) is a fine-tuned iteration of the model IDEFICS-9b-instruct, which, in turn, is a refined version of the IDEFICS-9b model, tailored for instruction following.

MIDEFICS has been fine-tuned specifically for medical question answering concerning images. Its capabilities include describing visual content (diagnosing), generating recommendations, or functioning solely as a medical language model without visual inputs.

This is the lora Adapter model.

Model Details

Model type: Multi-modal model (image+text)
Language(s) (NLP): en
License: MIT
Parent Model: idefics-9b-instruct
Resources for more information:
- Description of OBELICS: OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
- Original Paper: Flamingo: a Visual Language Model for Few-Shot Learning

IDEFICS is a substantial multimodal English model designed to process sequences of interleaved images and texts, producing corresponding text outputs. The model exhibits remarkable in-context few-shot learning capabilities, placing it on par with proprietary closed-source models. This positions IDEFICS as a reliable foundation for fine-tuning multimodal models with bespoke data.

IDEFICS leverages two pre-trained unimodal open-access models to facilitate the fusion of visual and textual modalities. Transformer blocks with newly initialized parameters bridge the gap between the vision encoder and the language model. Training data comprises a blend of image-text pairs and unstructured multimodal web documents.

IDEFICS-instruct is derived from further training IDEFICS on Supervised Fine-Tuning and Instruction Fine-Tuning datasets. This refinement notably enhances downstream performance, rendering idefics-9b-instruct a formidable model at its 9-billion parameter scale, while also making it more adept at engaging in conversations.

MIDEFICS emerges from additional training of IDEFICS on Supervised Fine-Tuning and Medical Conversation Fine-Tuning datasets. This refinement substantially bolsters downstream medical performance.

Uses

The model is capable of conducting inference on multimodal medical tasks, where inputs consist of a textual query or question accompanied by one or multiple images. Specifically fine-tuned for medical question answering tasks, the model exhibits proficiency in this domain.

Further fine-tuning of the model on additional data is viable, potentially leading to enhanced performance.

How to Get Started with the Model

We provide quick-start code for both the base and the instruct models.

Use that code to get started with the MIDEFICS model:

import torch
from transformers import IdeficsForVisionText2Text, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

checkpoint = "WinterSchool/Midefics-lora-v3"
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained(checkpoint)

# We feed to the model an arbitrary sequence of text strings and images. Images can be either URLs or PIL Images.
prompts = [
    [
        "User: What is in this image?",
        "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
        "<end_of_utterance>",

        "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

        "\nUser:",
        "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
        "And who is that?<end_of_utterance>",

        "\nAssistant:",
    ],
]

# --batched mode
inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
# --single sample mode
# inputs = processor(prompts[0], return_tensors="pt").to(device)

# Generation args
exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
    print(f"{i}:\n{t}\n")

WinterSchool
/

Midefics-lora

MIDEFICS

Model Details

Uses

How to Get Started with the Model

Dataset used to train WinterSchool/Midefics-lora