Transformers documentation

Idefics2

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Idefics2

Overview

The Idefics2 model was created by the Hugging Face M4 team and authored by Léo Tronchon, Hugo Laurencon, Victor Sanh. The accompanying blog post can be found here.

Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon IDEFICS-1, notably on document understanding, OCR, or visual reasoning. Idefics2 is lightweight (8 billion parameters) and treats images in their native aspect ratio and resolution, which allows for varying inference efficiency.

Tips:

  • Each sample can contain multiple images, and the number of images can vary between samples. The processor will pad the inputs to the maximum number of images in a batch for input to the model.
  • The processor has a do_image_splitting option. If True, each input image will be split into 4 sub-images, and concatenated with the original to form 5 images. This is useful for increasing model performance. Make sure processor.image_processor.do_image_splitting is set to False if the model was not trained with this option.
  • text passed to the processor should have the <image> tokens where the images should be inserted. And <end_of_utterance> at the end of each utterance if the text is a chat message.
  • The processor has its own apply_chat_template method to convert chat messages to text that can then be passed as text to the processor.

Example of how to use the processor on chat messages:

import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"

image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
images = [image_1, image_2]

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What’s the difference between these two images?"},
        {"type": "image"},
        {"type": "image"},
    ],
}]

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")

text = processor.apply_chat_template(messages)
# "User: What’s the difference between these two images?<image><image><end_of_utterance>\n"
print(text)

inputs = processor(images=images, text=text)

generated_text = model.generate(**inputs)

This model was contributed by amyeroberts. The original code can be found here.

Idefics2Config

class transformers.Idefics2Config

< >

( use_cache = True image_token_id = 32001 tie_word_embeddings = False vision_config = None perceiver_config = None text_config = None **kwargs )

Parameters

  • use_cache (bool, optional, defaults to True) — Whether or not the model should cache the key/value pairs of the attention mechanism.
  • image_token_id (int, optional, defaults to 32001) — The id of the “image” token.
  • tie_word_embeddings (bool, optional, defaults to False) — Whether or not to tie the word embeddings with the token embeddings.
  • vision_config (IdeficsVisionConfig or dict, optional) — Custom vision config or dict
  • perceiver_config (IdeficsPerceiverConfig or dict, optional) — Custom perceiver config or dict
  • text_config (MistralConfig or dict, optional) — Custom text config or dict for the text model

This is the configuration class to store the configuration of a Idefics2Model. It is used to instantiate a Idefics2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the model of the Idefics2 HuggingFaceM4/idefics2-8b architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import Idefics2Model, Idefics2Config
>>> # Initializing configuration
>>> configuration = Idefics2Config()
>>> # Initializing a model from the configuration
>>> model = Idefics2Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

Idefics2Model

class transformers.Idefics2Model

< >

( config: Idefics2Config )

Parameters

  • config (Idefics2Config or Idefics2VisionConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Idefics2 model consisting of a SIGLIP vision encoder and Mistral language decoder This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_ids: LongTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None pixel_values: Optional = None pixel_attention_mask: Optional = None image_hidden_states: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None )

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values).

    If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask and modify to your needs. See diagram 1 in the paper for more information on the default strategy.

    • 1 indicates the head is not masked,
    • 0 indicates the head is masked.
  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs?
  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) -- The tensors corresponding to the input images. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.40.1/en/model_doc/auto#transformers.AutoImageProcessor). See [CLIPImageProcessor.__call__()](/docs/transformers/v4.40.1/en/model_doc/fuyu#transformers.FuyuImageProcessor.__call__) for details ([]LlavaProcessor`] uses CLIPImageProcessor for processing images).
  • pixel_attention_mask (torch.Tensor of shape (batch_size, image_size, image_size), optional) — Mask to avoid performing attention on padding pixel indices.
  • image_hidden_states (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — The hidden states of the image encoder after modality projection and perceiver resampling.
  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
  • return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

The Idefics2Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Inputs fed to the model can have an arbitrary number of images. To account for this, pixel_values fed to the model have image padding -> (batch_size, max_num_images, 3, max_heights, max_widths) where max_num_images is the maximum number of images among the batch_size samples in the batch.

Padding images are not needed beyond padding the pixel_values at the entrance of the model. For efficiency, we only pass through the vision_model’s forward the real images by discarding the padding images i.e. pixel_values of size (image_batch_size, 3, height, width) where image_batch_size would be 7 when num_images_per_sample=[1, 3, 1, 2] and max_num_images would be 3.

Idefics2ForConditionalGeneration

class transformers.Idefics2ForConditionalGeneration

< >

( config )

Parameters

  • config (Idefics2Config or Idefics2VisionConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The Idefics2 Model with a language modeling head. It is made up a SigLIP vision encoder, with a language modeling head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_ids: LongTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None pixel_values: Optional = None pixel_attention_mask: Optional = None image_hidden_states: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast or tuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values).

    If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask and modify to your needs. See diagram 1 in the paper for more information on the default strategy.

    • 1 indicates the head is not masked,
    • 0 indicates the head is masked.
  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs?
  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) -- The tensors corresponding to the input images. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.40.1/en/model_doc/auto#transformers.AutoImageProcessor). See [CLIPImageProcessor.__call__()](/docs/transformers/v4.40.1/en/model_doc/fuyu#transformers.FuyuImageProcessor.__call__) for details ([]LlavaProcessor`] uses CLIPImageProcessor for processing images).
  • pixel_attention_mask (torch.Tensor of shape (batch_size, image_size, image_size), optional) — Mask to avoid performing attention on padding pixel indices.
  • image_hidden_states (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — The hidden states of the image encoder after modality projection and perceiver resampling.
  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
  • return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

    Args — labels (torch.LongTensor of shape (batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

Returns

transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast or tuple(torch.FloatTensor)

A transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Idefics2Config) and inputs.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Language modeling loss (for next-token prediction).
  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.
  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
  • image_hidden_states (tuple(torch.FloatTensor), optional) — Tuple of torch.FloatTensor (one for the output of the image embeddings, (batch_size, num_images, sequence_length, hidden_size). image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver

The Idefics2ForConditionalGeneration forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> import requests
>>> import torch
>>> from PIL import Image
>>> from io import BytesIO

>>> from transformers import AutoProcessor, AutoModelForVision2Seq
>>> from transformers.image_utils import load_image

>>> # Note that passing the image urls (instead of the actual pil images) to the processor is also possible
>>> image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
>>> image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
>>> image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

>>> processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
>>> model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b-base", device_map="auto")

>>> BAD_WORDS_IDS = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> EOS_WORDS_IDS = [processor.tokenizer.eos_token_id]

>>> # Create inputs
>>> prompts = [
...   "<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
...   "In which city is that bridge located?<image>",
... ]
>>> images = [[image1, image2], [image3]]
>>> inputs = processor(text=prompts, padding=True, return_tensors="pt").to("cuda")

>>> # Generate
>>> generated_ids = model.generate(**inputs, bad_words_ids=BAD_WORDS_IDS, max_new_tokens=20)
>>> generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

>>> print(generated_texts)
['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of New York, and more specifically the Statue of Liberty.\n\n', 'In which city is that bridge located?\n\nThe bridge is located in the city of Pittsburgh, Pennsylvania.\n\n\nThe bridge is']

Idefics2ImageProcessor

class transformers.Idefics2ImageProcessor

< >

( do_convert_rgb: bool = True do_resize: bool = True size: Dict = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_normalize: bool = True image_mean: Union = None image_std: Union = None do_pad: bool = True do_image_splitting: bool = False **kwargs )

Parameters

  • do_convert_rgb (bool, optional, defaults to True) — Whether to convert the image to RGB. This is useful if the input image is of a different format e.g. RGBA. Only has an effect if the input image is in the PIL format.
  • do_resize (bool, optional, defaults to True) — Whether to resize the image. The longest edge of the image is resized to be <= size["longest_edge"], with the shortest edge resized to keep the input aspect ratio, with a minimum size of size["shortest_edge"].
  • size (Dict, optional) — Controls the size of the output image. This is a dictionary containing the keys “shortest_edge” and “longest_edge”.
  • resample (Resampling, optional, defaults to Resampling.BILINEAR) — Resampling filter to use when resizing the image.
  • do_rescale (bool, optional, defaults to True) — Whether to rescale the image. If set to True, the image is rescaled to have pixel values between 0 and 1.
  • rescale_factor (float, optional, defaults to 1/255) — Rescale factor to rescale the image by if do_rescale is set to True.
  • do_normalize (bool, optional, defaults to True) — Whether to normalize the image. If set to True, the image is normalized to have a mean of image_mean and a standard deviation of image_std.
  • image_mean (float or List[float], optional, defaults to IDEFICS_STANDARD_MEAN) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean parameter in the preprocess method. Can be overridden by the image_mean parameter in the preprocess method.
  • image_std (float or List[float], optional, defaults to IDEFICS_STANDARD_STD) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std parameter in the preprocess method. Can be overridden by the image_std parameter in the preprocess method.
  • do_pad (bool, optional, defaults to True) — Whether or not to pad the images to the largest height and width in the batch and number of images per sample in the batch, such that the returned tensor is of shape (batch_size, max_num_images, num_channels, max_height, max_width).
  • do_image_splitting (bool, optional, defaults to False) — Whether to split the image into a sequence 4 equal sub-images concatenated with the original image. That strategy was first introduced in https://arxiv.org/abs/2311.06607.

Constructs a Idefics image processor.

preprocess

< >

( images: Union do_convert_rgb: Optional = None do_resize: Optional = None size: Optional = None resample: Resampling = None do_rescale: Optional = None rescale_factor: Optional = None do_normalize: Optional = None image_mean: Union = None image_std: Union = None do_pad: Optional = None do_image_splitting: Optional = None return_tensors: Union = None input_data_format: Optional = None data_format: Optional = <ChannelDimension.FIRST: 'channels_first'> )

Parameters

  • images (ImageInput) — A list of images to preprocess.
  • do_convert_rgb (bool, optional, defaults to self.do_convert_rgb) — Whether to convert the image to RGB.
  • do_resize (bool, optional, defaults to self.do_resize) — Whether to resize the image.
  • size (Dict[str, int], optional, defaults to self.size) — Size of the image after resizing. Shortest edge of the image is resized to size[“shortest_edge”], with the longest edge resized to keep the input aspect ratio.
  • resample (int, optional, defaults to self.resample) — Resampling filter to use if resizing the image. This can be one of the enum PILImageResampling. Only has an effect if do_resize is set to True.
  • do_rescale (bool, optional, defaults to self.do_rescale) — Whether to rescale the image.
  • rescale_factor (float, optional, defaults to self.rescale_factor) — Rescale factor to rescale the image by if do_rescale is set to True.
  • do_normalize (bool, optional, defaults to self.do_normalize) — Whether to normalize the image.
  • image_mean (float or List[float], optional, defaults to self.image_mean) — Image mean to use for normalization. Only has an effect if do_normalize is set to True.
  • image_std (float or List[float], optional, defaults to self.image_std) — Image standard deviation to use for normalization. Only has an effect if do_normalize is set to True.
  • do_pad (bool, optional, defaults to self.do_pad) — Whether or not to pad the images to the largest height and width in the batch.
  • do_image_splitting (bool, optional, defaults to self.do_image_splitting) — Whether to split the image into a sequence 4 equal sub-images concatenated with the original image. That strategy was first introduced in https://arxiv.org/abs/2311.06607.
  • return_tensors (str or TensorType, optional) — The type of tensors to return. Can be one of:
    • Unset: Return a list of np.ndarray.
    • TensorType.TENSORFLOW or 'tf': Return a batch of type tf.Tensor.
    • TensorType.PYTORCH or 'pt': Return a batch of type torch.Tensor.
    • TensorType.NUMPY or 'np': Return a batch of type np.ndarray.
    • TensorType.JAX or 'jax': Return a batch of type jax.numpy.ndarray.
  • data_format (ChannelDimension or str, optional, defaults to ChannelDimension.FIRST) — The channel dimension format for the output image. Can be one of:
    • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
    • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
    • Unset: Use the channel dimension format of the input image.
  • input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
    • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
    • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
    • "none" or ChannelDimension.NONE: image in (height, width) format.

Preprocess a batch of images.

Idefics2Processor

class transformers.Idefics2Processor

< >

( image_processor tokenizer = None image_seq_len: int = 64 **kwargs )

Parameters

  • image_processor (Idefics2ImageProcessor) — An instance of Idefics2ImageProcessor. The image processor is a required input.
  • tokenizer (PreTrainedTokenizerBase, optional) — An instance of PreTrainedTokenizerBase. This should correspond with the model’s text model. The tokenizer is a required input.
  • image_seq_len (int, optional, defaults to 64) — The length of the image sequence i.e. the number of tokens per image in the input. This parameter is used to build the string from the input prompt and image tokens and should match the config.perceiver_config.resampler_n_latents value for the model used.

Constructs a IDEFICS2 processor which wraps a LLama tokenizer and IDEFICS2 image processor into a single processor.

IdeficsProcessor offers all the functionalities of Idefics2ImageProcessor and LlamaTokenizerFast. See the docstring of call() and decode() for more information.

__call__

< >

( text: Union = None images: Union = None image_seq_len: Optional = None padding: Union = False truncation: Union = None max_length: Optional = None is_split_into_words: bool = False add_special_tokens: bool = True return_tensors: Union = None )

Parameters

  • text (Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

    Wherever an image token, <image> is encountered it is expanded to <fake_token_around_image> + <image> image_seq_len `.

  • images (PIL.Image.Image, np.ndarray, torch.Tensor, List[PIL.Image.Image], List[np.ndarray], List[torch.Tensor], optional) — The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. If is of type List[ImageInput], it’s assumed that this is for a single prompt i.e. of batch size 1.
  • image_seq_len (int, optional) — The length of the image sequence. If not provided, the default value is used.
  • padding (Union[bool, str, PaddingStrategy], optional, defaults to False) — Padding strategy applied to the input ids. See PreTrainedTokenizerFast.pad() for more information.
  • truncation (Union[bool, str, TruncationStrategy], optional) — Truncation strategy applied to the input ids. See PreTrainedTokenizerFast.truncate for more information.
  • max_length (int, optional) — Maximum length of the returned list and optionally padding/truncation length. See PreTrainedTokenizerFast.call() for more information.
  • is_split_into_words (bool, optional, defaults to False) — Whether the input text is split into words or not. If set to True, the tokenizer will skip the tokenization process and assume the input is already tokenized.
  • add_special_tokens (bool, optional, defaults to True) — Whether to add special tokens or not. See PreTrainedTokenizerFast.call() for more information.
  • return_tensors (Union[str, TensorType], optional) — If set, will return tensors of a particular framework. See PreTrainedTokenizerFast.call() for more information.

Processes the input prompts and returns a BatchEncoding.

Example:

>>> import requests
>>> from transformers import Idefics2Processor
>>> from transformers.image_utils import load_image

>>> processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b", image_seq_len=2)
>>> processor.image_processor.do_image_splitting = False  # Force as False to simplify the example

>>> url1 = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
>>> url2 = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"

>>> image1, image2 = load_image(url1), load_image(url2)
>>> images = [[image1], [image2]]

>>> text = [
...     "<image>In this image, we see",
...     "bla bla bla<image>",
... ]
>>> outputs = processor(text=text, images=images, return_tensors="pt", padding=True)
>>> input_ids = outputs.input_ids
>>> input_tokens = processor.tokenizer.batch_decode(input_ids)
>>> print(input_tokens)
['<s><fake_token_around_image><image><image><fake_token_around_image> In this image, we see', '<s> bla bla bla<fake_token_around_image><image><image><fake_token_around_image>']
< > Update on GitHub