Transformers documentation

Llama4

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.57.1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Meta는 이 모델을 2025-04-05에 출시하고 같은 날 Hugging Face Transformers에 추가했습니다.

Llama4

PyTorch FlashAttention Tensor parallelism

Meta에서 개발한 Llama 4는 새로운 자기회귀 Mixture-of-Experts (MoE) 아키텍처를 도입합니다. 이 세대는 두 가지 모델로 나뉩니다:

  • 128개의 전문가(expert)를 사용하여 총 약 400B 매개변수 중 17B 활성 매개변수를 갖는 고성능 Llama 4 Maverick
  • 16개의 전문가만 사용하여 총 약 109B 매개변수 중 17B 활성 매개변수를 갖는 경량화된 Llama 4 Scout
  • 두 모델 모두 네이티브 멀티모달을 위한 초기 융합(early fusion)을 활용하여 텍스트와 이미지 입력을 처리할 수 있습니다. Maverick과 Scout 모두 200개 언어를 포함하는 데이터에서 최대 40조개의 토큰으로 훈련되었습니다. (아랍어, 스페인어, 독일어, 힌디어를 포함한 12개 언어에 대한 특정 미세 조정 지원 포함)

Meta는 Llama 4 Scout을 누구나 쉽게 사용할 수 있도록 설계했습니다. Scout은 4비트 또는 8비트 양자화를 적용하면 단일 서버급 GPU에서도 실시간으로 실행할 수 있습니다. 반면, 더 대규모인 Llama 4 Maverick은 고성능 연산을 위해 BF16과 FP8 형식으로 제공합니다. 이 모델들은 모델 저장소에서 제공되는 사용자 지정 Llama 4 커뮤니티 라이선스 계약에 따라 출시됩니다.

모든 원본 Llama 체크포인트는 hugging face meta-llama 페이지에서 확인하실 수 있습니다.

Llama 4 모델 패밀리는 두 가지 형태로 제공됩니다: 109B와 402B 매개변수입니다. 이 두 형태 모두 매우 큰 모델이며 일반적인 기기에서는 실행할 수 없습니다. 아래에 메모리 사용량을 줄이는 방법 몇 가지를 정리했습니다.

더욱 빠르고 안정적인 다운로드를 위해 hf_xet 종속성 설치를 권장합니다: pip install transformers[hf_xet]

아래 예시들은 Pipeline 또는 AutoModel로 생성하는 방법을 보여줍니다. 또한 일부 Llama 4 변형이 최대 1천만 토큰의 컨텍스트 길이를 갖기 때문에, 매우 긴 컨텍스트 생성을 활성화하기 위해 올바른 속성을 토글하는 방법을 보여주는 예시도 추가했습니다.

Pipeline
AutoModel - Text only
AutoModel - Multimodal
AutoModel - Multimodal with multiple images
AutoModel - Long context
from transformers import pipeline
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

messages = [
    {"role": "user", "content": "마요네즈 레시피가 무엇인가요?"},
]

pipe = pipeline(
    "text-generation",
    model=model_id,
    device_map="auto",
    dtype=torch.bfloat16
)

output = pipe(messages, do_sample=False, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

효율성; Llama 4의 최대 성능 활용하기

어텐션 방법

기본 설정으로 주어지는 어텐션 함수를 변경하면 계산 성능과 메모리 사용량을 크게 개선할 수 있습니다. 인터페이스에 대한 자세한 설명은 어텐션 인터페이스 개요를 참조하세요.

Llama 4 모델은 처음 공개될 때부터 다음 어텐션 방식을 지원합니다: eager, flex_attention, sdpa. 최상의 결과를 위해 flex_attention 사용을 권장합니다. 어텐션 메커니즘 전환은 모델을 초기화할 때 이루어집니다:

Flex Attention
SDPA
Eager

Flex Attention은 모델이 긴 컨텍스트를 처리할 때 최적의 성능을 발휘합니다.

주의: 아래 예시는 device_map="auto"와 flex-attention을 모두 사용합니다. 이 예시를 텐서 병렬 모드로 실행하려면 torchrun을 사용하세요.

향후 텐서 병렬 없이 device_map="auto"와 flex-attention을 함께 실행할 수 있도록 작업할 예정입니다.

from transformers import Llama4ForConditionalGeneration
import torch

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    dtype=torch.bfloat16,
)

양자화

양자화는 가중치를 더 낮은 정밀도로 바꿔 대형 모델의 메모리 부담을 줄입니다. 사용 가능한 양자화 백엔드에 대해서는 양자화 개요를 참조하세요. 현재는 FBGEMM과 LLM-Compressor를 지원하며, 곧 더 많은 방식이 추가될 예정입니다.

두 가지 방법을 사용하는 예시를 아래에서 확인하세요:

다음은 FBGEMM 접근법을 사용하여 BF16 모델을 FP8로 로드하는 예시입니다:

FBGEMM
LLM-Compressor
from transformers import AutoTokenizer, Llama4ForConditionalGeneration, FbgemmFp8Config
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "당신은 누구신가요?"},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
    quantization_config=FbgemmFp8Config()
)

outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])

오프로딩

CPU 오프로딩을 활성화하면, GPU 메모리가 부족할 때 모델이 구성 요소를 CPU로 이동시킵니다. 추론 시 다양한 구성 요소들이 GPU와 CPU 간에 동적으로 로드되고 언로드됩니다. 이를 통해 CPU 메모리가 충분한 한 더 작은 머신에서도 모델을 로드할 수 있습니다. 다만 통신 오버헤드로 인해 추론 속도가 느려질 수 있습니다.

CPU 오프로딩을 활성화하려면 모델 로드 시 device_mapauto로 지정하면 됩니다

from transformers import Llama4ForConditionalGeneration
import torch

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
)

Llama4Config

class transformers.Llama4Config

< >

( vision_config = None text_config = None boi_token_index = 200080 eoi_token_index = 200081 image_token_index = 200092 tie_word_embeddings = False **kwargs )

Parameters

  • vision_config (Llama4VisionConfig, optional) — The Llama4 Vision config.
  • text_config (Llama4TextConfig, optional) — The Llama4 Text config.
  • boi_token_index (int, optional, defaults to 200080) — The begin-of-image token index to wrap the image prompt.
  • eoi_token_index (int, optional, defaults to 200081) — The end-of-image token index to wrap the image prompt.
  • image_token_index (int, optional, defaults to 200092) — The image token index to encode the image prompt.
  • tie_word_embeddings (bool, optional, defaults to False) — Whether the model’s input and output word embeddings should be tied.

This is the configuration class to store the configuration of a Llama4Model. It is used to instantiate an Llama4 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Llama4 109B.

e.g. meta-llama/Llama-4-Scout-17B-16E

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

>>> from transformers import Llama4Model, Llama4Config

>>> # Initializing a Llama4 7B style configuration
>>> configuration = Llama4Config()

>>> # Initializing a model from the Llama4 7B style configuration
>>> model = Llama4Model(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Llama4TextConfig

class transformers.Llama4TextConfig

< >

( vocab_size = 202048 hidden_size = 5120 intermediate_size = 8192 intermediate_size_mlp = 16384 num_hidden_layers = 48 num_attention_heads = 40 num_key_value_heads = 8 head_dim = 128 hidden_act = 'silu' max_position_embeddings = 131072 initializer_range = 0.02 rms_norm_eps = 1e-05 use_cache = True pad_token_id = None bos_token_id = 1 eos_token_id = 2 tie_word_embeddings = False attention_dropout = 0.0 num_experts_per_tok = 1 num_local_experts = 16 moe_layers = None interleave_moe_layer_step = 1 use_qk_norm = True output_router_logits = False router_aux_loss_coef = 0.001 router_jitter_noise = 0.0 rope_parameters: typing.Union[transformers.modeling_rope_utils.RopeParameters, dict[transformers.modeling_rope_utils.RopeParameters], NoneType] = None no_rope_layers = None no_rope_layer_interval = 4 attention_chunk_size = 8192 layer_types = None attn_temperature_tuning = True floor_scale = 8192 attn_scale = 0.1 **kwargs )

Parameters

  • vocab_size (int, optional, defaults to 202048) — Vocabulary size of the Llama4 text model. Defines the maximum number of different tokens that can be represented by the inputs_ids passed when calling Llama4TextModel.
  • hidden_size (int, optional, defaults to 5120) — Dimensionality of the embeddings and hidden states.
  • intermediate_size (int, optional, defaults to 8192) — Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
  • intermediate_size_mlp (int, optional, defaults to 16384) — TODO
  • num_hidden_layers (int, optional, defaults to 48) — Number of hidden layers in the Transformer encoder.
  • num_attention_heads (int, optional, defaults to 40) — Number of attention heads for each attention layer in the Transformer encoder.
  • num_key_value_heads (int, optional, defaults to 8) — This is the number of key_value heads that should be used to implement Grouped Query Attention. If not specified, will default to num_attention_heads.
  • head_dim (int, optional, defaults to 128) — TODO
  • hidden_act (str or Callable, optional, defaults to "silu") — The non-linear activation function (function or string) in the encoder and pooler.
  • max_position_embeddings (int, optional, defaults to 131072) — The maximum sequence length that this model might ever be used with.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • rms_norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers.
  • use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions.
  • pad_token_id (int, optional, defaults to 128004) — The id of the padding token.
  • bos_token_id (int, optional, defaults to 1) — The id of the beginning of sentence token.
  • eos_token_id (int, optional, defaults to 2) — The id of the end of sentence token.
  • tie_word_embeddings (bool, optional, defaults to False) — Whether to tie weight embeddings
  • attention_dropout (int, optional, defaults to 0.0) — TODO
  • num_experts_per_tok (int, optional, defaults to 1) — TODO
  • num_local_experts (int, optional, defaults to 16) — TODO
  • moe_layers (int, optional) — TODO
  • interleave_moe_layer_step (int, optional, defaults to 1) — TODO
  • use_qk_norm (int, optional, defaults to True) — TODO
  • output_router_logits (int, optional, defaults to False) — TODO
  • router_aux_loss_coef (int, optional, defaults to 0.001) — TODO
  • router_jitter_noise (int, optional, defaults to 0.0) — TODO
  • rope_parameters (RopeParameters, optional) — Dictionary containing the configuration parameters for the RoPE embeddings. The dictionaty should contain a value for rope_theta and optionally parameters used for scaling in case you want to use RoPE with longer max_position_embeddings.
  • no_rope_layers (list[int], optional) — List with at least the same length as the number of layers in the model. A 1 at an index position indicates that the corresponding layer will use RoPE, while a 0 indicates that it’s a NoPE layer.
  • no_rope_layer_interval (int, optional, defaults to 4) — If no_rope_layers is None, it will be created using a NoPE layer every no_rope_layer_interval layers.
  • attention_chunk_size (int, optional, defaults to 8192) —
  • layer_types (list, optional) — Attention pattern for each layer.
  • attn_temperature_tuning (bool, optional, defaults to True) — Whether to dynamically scale the attention temperature for each query token based on sequence length. Recommended for long sequences (e.g., >32k tokens) to maintain stable output results.
  • floor_scale (int, optional, defaults to 8192) — TODO
  • attn_scale (int, optional, defaults to 0.1) — TODO

This is the configuration class to store the configuration of a Llama4TextModel. It is used to instantiate a Llama4 text model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Llama4 109B.

e.g. meta-llama/Llama-4-Scout-17B-16E

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Example:

Llama4VisionConfig

class transformers.Llama4VisionConfig

< >

( hidden_size: typing.Optional[int] = 768 hidden_act: typing.Optional[str] = 'gelu' num_hidden_layers: typing.Optional[int] = 34 num_attention_heads: typing.Optional[int] = 16 num_channels: typing.Optional[int] = 3 intermediate_size: typing.Optional[int] = 5632 vision_output_dim: typing.Optional[int] = 7680 image_size: typing.Optional[int] = 448 patch_size: typing.Optional[int] = 14 norm_eps: typing.Optional[float] = 1e-05 vision_feature_select_strategy: typing.Optional[str] = 'default' initializer_range: typing.Optional[float] = 0.02 pixel_shuffle_ratio: typing.Optional[float] = 0.5 projector_input_dim: typing.Optional[int] = 4096 projector_output_dim: typing.Optional[int] = 4096 multi_modal_projector_bias: typing.Optional[bool] = False projector_dropout: typing.Optional[float] = 0.0 attention_dropout: typing.Optional[float] = 0.0 rope_parameters: typing.Union[transformers.modeling_rope_utils.RopeParameters, dict[transformers.modeling_rope_utils.RopeParameters], NoneType] = None **kwargs )

Parameters

  • hidden_size (int, optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer.
  • hidden_act (str or function, optional, defaults to "gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" "quick_gelu" are supported.
  • num_hidden_layers (int, optional, defaults to 34) — Number of hidden layers in the Transformer encoder.
  • num_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer encoder.
  • num_channels (int, optional, defaults to 3) — Number of channels in the input image.
  • intermediate_size (int, optional, defaults to 5632) — Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
  • vision_output_dim (int, optional, defaults to 7680) — Dimensionality of the vision model output. Includes output of transformer encoder with intermediate layers and global transformer encoder.
  • image_size (int, optional, defaults to 448) — The size (resolution) of each image tile.
  • patch_size (int, optional, defaults to 14) — The size (resolution) of each patch.
  • norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the layer normalization layers.
  • vision_feature_select_strategy (int, optional, defaults to "default") — TODO
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • pixel_shuffle_ratio (int, optional, defaults to 0.5) — TODO
  • projector_input_dim (int, optional, defaults to 4096) — TODO
  • projector_output_dim (int, optional, defaults to 4096) — TODO
  • multi_modal_projector_bias (int, optional, defaults to False) — TODO
  • projector_dropout (int, optional, defaults to 0.0) — TODO
  • attention_dropout (int, optional, defaults to 0.0) — TODO
  • rope_parameters (RopeParameters, optional) — RoPE Parameters

This is the configuration class to store the configuration of a Llama4VisionModel. It is used to instantiate a Llama4 vision model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Llama4 109B.

e.g. meta-llama/Llama-4-Scout-17B-16E

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Llama4Processor

class transformers.Llama4Processor

< >

( image_processor = None tokenizer = None patch_size: int = 14 pixel_shuffle_ratio: float = 0.5 fake_image_token = '<|image|>' image_token = '<|image|>' start_of_image_token = '<|image_start|>' end_of_image_token = '<|image_end|>' patch_token = '<|patch|>' tile_x_separator_token = '<|tile_x_separator|>' tile_y_separator_token = '<|tile_y_separator|>' chat_template = '{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now("%d %b %Y") %}\n {%- else %}\n {%- set date_string = "26 Jul 2024" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0][\'role\'] == \'system\' %} \n {%- if messages[0][\'content\'] is string %}\n {%- set system_message = messages[0][\'content\']|trim %}\n {%- else %}\n {#- FIXME: The processor requires an array, always. #}\n {%- set system_message = messages[0][\'content\'][0][\'text\']|trim %}\n {%- endif %}\n {%- set messages = messages[1:] %}\n {%- set user_supplied_system_message = true %}\n{%- else %}\n {%- set system_message = "" %}\n {%- set user_supplied_system_message = false %}\n{%- endif %}\n\n{#- System message if the user supplied one #}\n{%- if user_supplied_system_message %}\n {{- "<|header_start|>system<|header_end|>\n\n" }}\n {%- if tools is not none %}\n {{- "Environment: ipython\n" }}\n {%- endif %}\n {%- if tools is not none and not tools_in_user_message %}\n {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}\n {{- \'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.\' }}\n {{- "Do not use variables.\n\n" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- "\n\n" }}\n {%- endfor %}\n {%- endif %}\n {{- system_message }}\n {{- "<|eot|>" }}\n{%- endif %}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0][\'content\']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception("Cannot put tools in the first user message when there\'s no first user message!") }}\n{%- endif %}\n {{- \'<|header_start|>user<|header_end|>\n\n\' -}}\n {{- "Given the following functions, please respond with a JSON for a function call " }}\n {{- "with its proper arguments that best answers the given prompt.\n\n" }}\n {{- \'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.\' }}\n {{- "Do not use variables.\n\n" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- "\n\n" }}\n {%- endfor %}\n {{- first_user_message + "<|eot|>"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == \'ipython\' or message.role == \'tool\' or \'tool_calls\' in message) %}\n {{- \'<|header_start|>\' + message[\'role\'] + \'<|header_end|>\n\n\' }}\n {%- if message[\'content\'] is string %}\n {{- message[\'content\'] }}\n {%- else %}\n {%- for content in message[\'content\'] %}\n {%- if content[\'type\'] == \'image\' %}\n {{- \'<|image|>\' }}\n {%- elif content[\'type\'] == \'text\' %}\n {{- content[\'text\'] }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- "<|eot|>" }}\n {%- elif \'tool_calls\' in message and message.tool_calls|length > 0 %}\n {{- \'<|header_start|>assistant<|header_end|>\n\n\' -}}\n {{- \'<|python_start|>\' }}\n {%- if message[\'content\'] is string %}\n {{- message[\'content\'] }}\n {%- else %}\n {%- for content in message[\'content\'] %}\n {%- if content[\'type\'] == \'image\' %}\n {{- \'<|image|>\' }}\n {%- elif content[\'type\'] == \'text\' %}\n {{- content[\'text\'] }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- \'<|python_end|>\' }}\n {%- for tool_call in message.tool_calls %}\n {{- \'{"name": "\' + tool_call.function.name + \'", \' }}\n {{- \'"parameters": \' }}\n {{- tool_call.function.arguments | tojson }}\n {{- "}" }}\n {%- endfor %}\n {{- "<|eot|>" }}\n {%- elif message.role == "tool" or message.role == "ipython" %}\n {{- "<|header_start|>ipython<|header_end|>\n\n" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- "<|eot|>" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- \'<|header_start|>assistant<|header_end|>\n\n\' }}\n{%- endif %}\n' **kwargs )

Parameters

  • image_processor (AutoImageProcessor, optional) — The image processor is a required input.
  • tokenizer ([PreTrainedTokenizer, PreTrainedTokenizerFast], optional) — The tokenizer is a required input.
  • patch_size (int, optional, defaults to 28) — The size of image patches for tokenization.
  • img_size (int, optional, defaults to 364) — The size of the image to be tokenized. This should correspond to the size given to the image processor.
  • image_token (str, optional, defaults to "<|image|>") — The token to be used to represent an image in the text.
  • downsample_factor (int, optional, defaults to 1) — The factor by which to scale the patch size.
  • start_of_img_token (str, optional, defaults to "<|START_OF_IMG|>") — The token to be used to represent the start of an image in the text.
  • end_of_img_token (str, optional, defaults to "<|END_OF_IMG|>") — The token to be used to represent the end of an image in the text.
  • img_patch_token (str, optional, defaults to "<|IMG_PATCH|>") — The token to be used to represent an image patch in the text.
  • img_line_break_token (str, optional, defaults to "<|IMG_LINE_BREAK|>") — The token to be used to represent a line break in the text.
  • tile_token (str, optional, defaults to "TILE") — The token to be used to represent an image patch in the text.
  • tile_global_token (str, optional, defaults to "TILE_GLOBAL") — The token to be used to represent the cover image in the text.
  • chat_template (str, optional) — A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.

Constructs a Llama4 processor which wraps a AutoImageProcessor and PretrainedTokenizerFast tokenizer into a single processor that inherits both the image processor and tokenizer functionalities. See the __call__() and decode() for more information.

Llama4ImageProcessorFast

class transformers.Llama4ImageProcessorFast

< >

( **kwargs: typing_extensions.Unpack[transformers.models.llama4.image_processing_llama4_fast.Llama4ImageProcessorKwargs] )

Constructs a fast Llama4 image processor.

preprocess

< >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] **kwargs: typing_extensions.Unpack[transformers.models.llama4.image_processing_llama4_fast.Llama4ImageProcessorKwargs] ) <class 'transformers.image_processing_base.BatchFeature'>

Parameters

  • images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.
  • do_convert_rgb (bool, optional) — Whether to convert the image to RGB.
  • do_resize (bool, optional) — Whether to resize the image.
  • size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) — Describes the maximum input dimensions to the model.
  • crop_size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) — Size of the output image after applying center_crop.
  • resample (Annotated[Union[PILImageResampling, int, NoneType], None]) — Resampling filter to use if resizing the image. This can be one of the enum PILImageResampling. Only has an effect if do_resize is set to True.
  • do_rescale (bool, optional) — Whether to rescale the image.
  • rescale_factor (float, optional) — Rescale factor to rescale the image by if do_rescale is set to True.
  • do_normalize (bool, optional) — Whether to normalize the image.
  • image_mean (Union[float, list[float], tuple[float, ...], NoneType]) — Image mean to use for normalization. Only has an effect if do_normalize is set to True.
  • image_std (Union[float, list[float], tuple[float, ...], NoneType]) — Image standard deviation to use for normalization. Only has an effect if do_normalize is set to True.
  • do_pad (bool, optional) — Whether to pad the image. Padding is done either to the largest size in the batch or to a fixed square size per image. The exact padding strategy depends on the model.
  • pad_size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) — The size in {"height": int, "width" int} to pad the images to. Must be larger than any image size provided for preprocessing. If pad_size is not provided, images will be padded to the largest height and width in the batch. Applied only when do_pad=True.
  • do_center_crop (bool, optional) — Whether to center crop the image.
  • data_format (Union[~image_utils.ChannelDimension, str, NoneType]) — Only ChannelDimension.FIRST is supported. Added for compatibility with slow processors.
  • input_data_format (Union[~image_utils.ChannelDimension, str, NoneType]) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
    • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
    • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
    • "none" or ChannelDimension.NONE: image in (height, width) format.
  • device (Annotated[str, None], optional) — The device to process the images on. If unset, the device is inferred from the input images.
  • return_tensors (Annotated[Union[str, ~utils.generic.TensorType, NoneType], None]) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors.
  • disable_grouping (bool, optional) — Whether to disable grouping of images by size to process them individually and not in batches. If None, will be set to True if the images are on CPU, and False otherwise. This choice is based on empirical observations, as detailed here: https://github.com/huggingface/transformers/pull/38157
  • max_patches (int, optional, defaults to 16) — The maximum number of patches to be extracted from the image. Can be overridden by the max_patches parameter in the preprocess method.
  • resize_to_max_canvas (bool, optional, defaults to False) — Whether to resize the image to the maximum canvas size. If True, picks the canvas the allows the largest resizing without distortion. If False, downsample as little as possible, including no resizing at all, but never upsample, unless the image is smaller than the patch size.

Returns

<class 'transformers.image_processing_base.BatchFeature'>

  • data (dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.).
  • tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.

rescale_and_normalize

< >

( images: torch.Tensor do_rescale: bool rescale_factor: float do_normalize: bool image_mean: typing.Union[float, list[float]] image_std: typing.Union[float, list[float]] )

Rescale and normalize images. Override to rescale and normalize the images in torch.bfloat16 as in the original implementation

Llama4ForConditionalGeneration

class transformers.Llama4ForConditionalGeneration

< >

( config: Llama4Config )

forward

< >

( input_ids: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None vision_feature_layer: typing.Union[list[int], int, NoneType] = None vision_feature_select_strategy: typing.Optional[str] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) transformers.models.llama4.modeling_llama4.Llama4CausalLMOutputWithPast or tuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using image_processor_class. See image_processor_class.__call__ for details (processor_class uses image_processor_class for processing images).
  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

    What are position IDs?

  • past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.

    Only Cache instance is allowed as input, see our kv cache guide. If no past_key_values are passed, DynamicCache will be initialized by default.

    The model will output the same cache format that is fed as input.

    If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • vision_feature_layer (Union[list[int], int, NoneType]) — The index of the layer to select the vision feature. If multiple indices are provided, the vision feature of the corresponding indices will be concatenated to form the vision features.
  • vision_feature_select_strategy (str, optional) — The feature selection strategy used to select the vision feature from the vision backbone. Can be one of "default" or "full".
  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].
  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
  • return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
  • cache_position (torch.LongTensor of shape (sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
  • logits_to_keep (Union[int, torch.Tensor], defaults to 0) — If an int, compute logits for the last logits_to_keep tokens. If 0, calculate logits for all input_ids (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If a torch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).

Returns

transformers.models.llama4.modeling_llama4.Llama4CausalLMOutputWithPast or tuple(torch.FloatTensor)

A transformers.models.llama4.modeling_llama4.Llama4CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (None) and inputs.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple[torch.FloatTensor], optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple[torch.FloatTensor], optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • image_hidden_states (torch.FloatTensor, optional) — A torch.FloatTensor of size (batch_size, num_images, sequence_length, hidden_size)`. image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.

The Llama4ForConditionalGeneration forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, LlavaForConditionalGeneration

>>> model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
>>> processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

>>> prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(images=image, text=prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(**inputs, max_new_tokens=15)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"USER:  \nWhat's the content of the image? ASSISTANT: The image features a busy city street with a stop sign prominently displayed"

get_image_features

< >

( pixel_values: FloatTensor vision_feature_select_strategy: str **kwargs ) image_features (torch.Tensor)

Parameters

  • pixel_values (torch.FloatTensor] of shape (batch_size, channels, height, width)) — The tensors corresponding to the input images.
  • vision_feature_select_strategy (str) — The feature selection strategy used to select the vision feature from the vision backbone. Can be one of "default" or "full"

Returns

image_features (torch.Tensor)

Image feature tensor of shape (num_images, image_length, embed_dim)).

Obtains image last hidden states from the vision tower and apply al projection.

get_placeholder_mask

< >

( input_ids: LongTensor inputs_embeds: FloatTensor image_features: FloatTensor )

Obtains multimodal placeholder mask from input_ids or inputs_embeds, and checks that the placeholder token count is equal to the length of multimodal features. If the lengths are different, an error is raised.

  • forward

Llama4ForCausalLM

class transformers.Llama4ForCausalLM

< >

( config: Llama4TextConfig )

forward

< >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

    What are position IDs?

  • past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.

    Only Cache instance is allowed as input, see our kv cache guide. If no past_key_values are passed, DynamicCache will be initialized by default.

    The model will output the same cache format that is fed as input.

    If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].
  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
  • cache_position (torch.LongTensor of shape (sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
  • logits_to_keep (Union[int, torch.Tensor], defaults to 0) — If an int, compute logits for the last logits_to_keep tokens. If 0, calculate logits for all input_ids (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If a torch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).

Returns

transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Llama4Config) and inputs.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The Llama4ForCausalLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoTokenizer, Llama4ForCausalLM

>>> model = Llama4ForCausalLM.from_pretrained("meta-llama4/Llama4-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama4/Llama4-2-7b-hf")

>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
  • forward

Llama4TextModel

class transformers.Llama4TextModel

< >

( config: Llama4TextConfig )

Parameters

  • config (Llama4TextConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare Llama4 Text Model outputting raw hidden-states without any specific head on to.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

    What are position IDs?

  • past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.

    Only Cache instance is allowed as input, see our kv cache guide. If no past_key_values are passed, DynamicCache will be initialized by default.

    The model will output the same cache format that is fed as input.

    If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
  • cache_position (torch.LongTensor of shape (sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.

Returns

transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)

A transformers.modeling_outputs.BaseModelOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Llama4Config) and inputs.

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The Llama4TextModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • forward

Llama4ForCausalLM

class transformers.Llama4ForCausalLM

< >

( config: Llama4TextConfig )

forward

< >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

    What are position IDs?

  • past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.

    Only Cache instance is allowed as input, see our kv cache guide. If no past_key_values are passed, DynamicCache will be initialized by default.

    The model will output the same cache format that is fed as input.

    If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].
  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
  • cache_position (torch.LongTensor of shape (sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
  • logits_to_keep (Union[int, torch.Tensor], defaults to 0) — If an int, compute logits for the last logits_to_keep tokens. If 0, calculate logits for all input_ids (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If a torch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).

Returns

transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Llama4Config) and inputs.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The Llama4ForCausalLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoTokenizer, Llama4ForCausalLM

>>> model = Llama4ForCausalLM.from_pretrained("meta-llama4/Llama4-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama4/Llama4-2-7b-hf")

>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
  • forward

Llama4VisionModel

class transformers.Llama4VisionModel

< >

( config: Llama4VisionConfig )

forward

< >

( pixel_values: Tensor attention_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None )

Example:

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, MllamaVisionModel

>>> checkpoint = "meta-llama/Llama-3.2-11B-Vision"
>>> model = MllamaVisionModel.from_pretrained(checkpoint)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(images=image, return_tensors="pt")

>>> output = model(**inputs)

>>> print(output.last_hidden_state.shape)
torch.Size([1, 1, 4, 1025, 7680])

get_input_embeddings

< >

( )

This function is used to fetch the first embedding layer to activate grads on inputs.

  • forward
Update on GitHub