Fuyu
Overview
The Fuyu model was created by ADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.
The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.
By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.
The Fuyu
models were trained using bfloat16
, but the original inference uses float16
The checkpoints uploaded on the hub use torch_dtype = 'float16'
which will be
used by the AutoModel
API to cast the checkpoints from torch.float32
to torch.float16
.
The dtype
of the online weights is mostly irrelevant, unless you are using torch_dtype="auto"
when initializing a model using model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")
. The reason is that the model will first be downloaded ( using the dtype
of the checkpoints online) then it will be cast to the default dtype
of torch
(becomes torch.float32
). Users should specify the torch_dtype
they want, and if they don’t it will be torch.float32
.
Finetuning the model in float16
is not recommended and known to produce nan
, as such the model should be fine-tuned in bfloat16
.
Tips:
- To convert the model, you need to clone the original repository using
git clone https://github.com/persimmon-ai-labs/adept-inference
, then get the checkpoints:
git clone https://github.com/persimmon-ai-labs/adept-inference
wget path/to/fuyu-8b-model-weights.tar
tar -xvf fuyu-8b-model-weights.tar
python src/transformers/models/fuyu/convert_fuyu_weights_to_hf.py --input_dir /path/to/downloaded/fuyu/weights/ --output_dir /output/path \
--pt_model_path /path/to/fuyu_8b_release/iter_0001251/mp_rank_00/model_optim_rng.pt
--ada_lib_path /path/to/adept-inference
For the chat model:
wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tar tar -xvf 8b_base_model_release.tar
Then, model can be loaded via:
from transformers import FuyuConfig, FuyuForCausalLM
model_config = FuyuConfig()
model = FuyuForCausalLM(model_config).from_pretrained('/output/path')
Inputs need to be passed through a specific Processor to have the correct formats. A processor requires an image_processor and a tokenizer. Hence, inputs can be loaded via:
from PIL import Image
from transformers import AutoTokenizer
from transformers.models.fuyu.processing_fuyu import FuyuProcessor
from transformers.models.fuyu.image_processing_fuyu import FuyuImageProcessor
tokenizer = AutoTokenizer.from_pretrained('adept-hf-collab/fuyu-8b')
image_processor = FuyuImageProcessor()
processor = FuyuProcessor(image_processor=image_processor, tokenizer=tokenizer)
text_prompt = "Generate a coco-style caption.\\n"
bus_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png"
bus_image_pil = Image.open(io.BytesIO(requests.get(bus_image_url).content))
inputs_to_model = processor(text=text_prompt, images=bus_image_pil)
This model was contributed by Molbap. The original code can be found here.
Fuyu uses a
sentencepiece
based tokenizer, with aUnigram
model. It supports bytefallback, which is only available intokenizers==0.14.0
for the fast tokenizer. TheLlamaTokenizer
is used as it is a standard wrapper around sentencepiece.The authors suggest to use the following prompt for image captioning:
f"Generate a coco-style caption.\\n"
FuyuConfig
class transformers.FuyuConfig
< source >( vocab_size = 262144 hidden_size = 4096 intermediate_size = 16384 num_hidden_layers = 36 num_attention_heads = 64 hidden_act = 'relu2' max_position_embeddings = 16384 image_size = 300 patch_size = 30 num_channels = 3 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True tie_word_embeddings = False rope_theta = 25000.0 rope_scaling = None qk_layernorm = True hidden_dropout = 0.0 attention_dropout = 0.0 partial_rotary_factor = 0.5 pad_token_id = None bos_token_id = 1 eos_token_id = 2 text_config = None **kwargs )
Parameters
- vocab_size (
int
, optional, defaults to 262144) — Vocabulary size of the Fuyu model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling FuyuForCausalLM - hidden_size (
int
, optional, defaults to 4096) — Dimension of the hidden representations. - intermediate_size (
int
, optional, defaults to 16384) — Dimension of the MLP representations. - num_hidden_layers (
int
, optional, defaults to 36) — Number of hidden layers in the Transformer encoder. - num_attention_heads (
int
, optional, defaults to 64) — Number of attention heads for each attention layer in the Transformer encoder. - hidden_act (
str
orfunction
, optional, defaults to"relu2"
) — The non-linear activation function (function or string) in the decoder. - max_position_embeddings (
int
, optional, defaults to 16384) — The maximum sequence length that this model might ever be used with. - image_size (
int
, optional, defaults to 300) — The input image size. - patch_size (
int
, optional, defaults to 30) — The input vision transformer encoding patch size. - num_channels (
int
, optional, defaults to 3) — The input image number of channels. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - layer_norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers. - use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant ifconfig.is_decoder=True
. Whether to tie weight embeddings - tie_word_embeddings (
bool
, optional, defaults toFalse
) — Whether to tie input and output embeddings. - rope_theta (
float
, optional, defaults to 25000.0) — The base period of the RoPE embeddings. - rope_scaling (
Dict
, optional) — Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is{"type": strategy name, "factor": scaling factor}
. When using this flag, don’t updatemax_position_embeddings
to the expected new maximum. See the following thread for more information on how these scaling strategies behave: https://www.reddit.com/r/LocalFuyu/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an experimental feature, subject to breaking API changes in future versions. - qk_layernorm (
bool
, optional, defaults toTrue
) — Whether or not to normalize the Queries and Keys after projecting the hidden states - hidden_dropout (
float
, optional, defaults to 0.0) — The dropout ratio after applying the MLP to the hidden states. - attention_dropout (
float
, optional, defaults to 0.0) — The dropout ratio after computing the attention scores. - partial_rotary_factor (
float
, optional, defaults to 0.5) — Percentage of the query and keys which will have rotary embedding. - pad_token_id (
int
, optional) — The id of the padding token. - bos_token_id (
int
, optional, defaults to 1) — The id of the beginning-of-sequence token. - eos_token_id (
Union[int, List[int]]
, optional, defaults to 2) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens. - text_config (
dict
, optional) — Dictionary of configuration options used to initialize thelanguage```Aut
.
This is the configuration class to store the configuration of a FuyuForCausalLM. It is used to instantiate an Fuyu model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the adept/fuyu-8b.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
## FuyuForCausalLM[[transformers.FuyuForCausalLM]]
class transformers.FuyuForCausalLM
< source >( config: FuyuConfig )
Parameters
- config (FuyuConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Fuyu Model with a language modeling head on top for causal language model conditioned on image patches and text. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None image_patches: Tensor = None image_patches_indices: Tensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None use_cache: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If
past_key_values
is used, optionally only the lastdecoder_input_ids
have to be input (seepast_key_values
).If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- image_patches (
torch.FloatTensor
of shape(batch_size, num_total_patches, patch_size_ x patch_size x num_channels)
, optional) — Image patches to be used as continuous embeddings. The patches are flattened and then projected to the hidden size of the model. - image_patches_indices (
torch.LongTensor
of shape(batch_size, num_total_patches + number_of_newline_tokens + number_of_text_tokens, patch_size_ x patch_size x num_channels )
, optional) — Indices indicating at which position the image_patches have to be inserted in input_embeds. - position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]
or -100 (seeinput_ids
docstring). Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
.
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (FuyuConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Language modeling loss (for next-token prediction). -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
)Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The FuyuForCausalLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import FuyuProcessor, FuyuForCausalLM
>>> from PIL import Image
>>> import requests
>>> processor = FuyuProcessor.from_pretrained("adept/fuyu-8b")
>>> model = FuyuForCausalLM.from_pretrained("adept/fuyu-8b")
>>> url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> prompt = "Generate a coco-style caption.\n"
>>> inputs = processor(text=prompt, images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> generated_ids = model.generate(**inputs, max_new_tokens=7)
>>> generation_text = processor.batch_decode(generated_ids[:, -7:], skip_special_tokens=True)
>>> print(generation_text[0])
A blue bus parked on the side of a road.
FuyuImageProcessor
class transformers.FuyuImageProcessor
< source >( do_resize: bool = True size: Optional = None resample: Resampling = <Resampling.BILINEAR: 2> do_pad: bool = True padding_value: float = 1.0 padding_mode: str = 'constant' do_normalize: bool = True image_mean: Union = 0.5 image_std: Union = 0.5 do_rescale: bool = True rescale_factor: float = 0.00392156862745098 patch_size: Optional = None **kwargs )
Parameters
- do_resize (
bool
, optional, defaults toTrue
) — Whether to resize the image tosize
. - size (
Dict[str, int]
, optional, defaults to{"height" -- 1080, "width": 1920}
): Dictionary in the format{"height": int, "width": int}
specifying the size of the output image. - resample (
PILImageResampling
, optional, defaults toResampling.BILINEAR
) —PILImageResampling
filter to use when resizing the image e.g.PILImageResampling.BILINEAR
. - do_pad (
bool
, optional, defaults toTrue
) — Whether to pad the image tosize
. - padding_value (
float
, optional, defaults to 1.0) — The value to pad the image with. - padding_mode (
str
, optional, defaults to"constant"
) — The padding mode to use when padding the image. - do_normalize (
bool
, optional, defaults toTrue
) — Whether to normalize the image. - image_mean (
float
, optional, defaults to 0.5) — The mean to use when normalizing the image. - image_std (
float
, optional, defaults to 0.5) — The standard deviation to use when normalizing the image. - do_rescale (
bool
, optional, defaults toTrue
) — Whether to rescale the image. - rescale_factor (
float
, optional, defaults to1 / 255
) — The factor to use when rescaling the image. - patch_size (
Dict[str, int]
, optional, defaults to{"height" -- 30, "width": 30}
): Dictionary in the format{"height": int, "width": int}
specifying the size of the patches.
This class should handle the image processing part before the main FuyuForCausalLM. In particular, it should handle:
Processing Images: Taking a batch of images as input. If the images are variable-sized, it resizes them based on the desired patch dimensions. The image output is always img_h, img_w of (1080, 1920)
Then, it patches up these images using the patchify_image function.
Creating Image Input IDs: For each patch, a placeholder ID is given to identify where these patches belong in a token sequence. For variable-sized images, each line of patches is terminated with a newline ID.
Image Patch Indices: For each image patch, the code maintains an index where these patches should be inserted in a token stream.
Preprocess an image or a batch of images.
FuyuProcessor
class transformers.FuyuProcessor
< source >( image_processor tokenizer **kwargs )
Parameters
- image_processor (FuyuImageProcessor) — The image processor is a required input.
- tokenizer (LlamaTokenizerFast) — The tokenizer is a required input.
Constructs a Fuyu processor which wraps a Fuyu image processor and a Llama tokenizer into a single processor.
FuyuProcessor offers all the functionalities of FuyuImageProcessor and LlamaTokenizerFast. See the
call() and decode()
for more information.
__call__
< source >( text = None images = None add_special_tokens: bool = True return_attention_mask: bool = True padding: Union = False truncation: Union = None max_length: Optional = None stride: int = 0 pad_to_multiple_of: Optional = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_token_type_ids: bool = False return_length: bool = False verbose: bool = True return_tensors: Union = None **kwargs ) → FuyuBatchEncoding
Parameters
- text (
str
,List[str]
) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - images (
PIL.Image.Image
,List[PIL.Image.Image]
) — The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. Both channels-first and channels-last formats are supported.
Returns
FuyuBatchEncoding
A FuyuBatchEncoding
with the following fields:
- input_ids — Tensor of token ids to be fed to a model. Returned when
text
is notNone
. - image_patches — List of Tensor of image patches. Returned when
images
is notNone
. - image_patches_indices — Tensor of indices where patch embeddings have to be inserted by the model.
- attention_mask — List of indices specifying which tokens should be attended to by the model when
return_attention_mask=True
.
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the text
and kwargs
arguments to LlamaTokenizerFast’s call() if text
is not None
to
encode the text. To prepare the image(s), this method forwards the images
and kwargs
arguments to
FuyuImageProcessor’s call() if images
is not None
. Please refer to the doctsring
of the above two methods for more information.