Transformers documentation

Llama4

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Meta는 이 모델을 2025-04-05에 출시하고 같은 날 Hugging Face Transformers에 추가했습니다.

Llama4

Meta에서 개발한 Llama 4는 새로운 자기회귀 Mixture-of-Experts (MoE) 아키텍처를 도입합니다. 이 세대는 두 가지 모델로 나뉩니다:

128개의 전문가(expert)를 사용하여 총 약 400B 매개변수 중 17B 활성 매개변수를 갖는 고성능 Llama 4 Maverick
16개의 전문가만 사용하여 총 약 109B 매개변수 중 17B 활성 매개변수를 갖는 경량화된 Llama 4 Scout
두 모델 모두 네이티브 멀티모달을 위한 초기 융합(early fusion)을 활용하여 텍스트와 이미지 입력을 처리할 수 있습니다. Maverick과 Scout 모두 200개 언어를 포함하는 데이터에서 최대 40조개의 토큰으로 훈련되었습니다. (아랍어, 스페인어, 독일어, 힌디어를 포함한 12개 언어에 대한 특정 미세 조정 지원 포함)

Meta는 Llama 4 Scout을 누구나 쉽게 사용할 수 있도록 설계했습니다. Scout은 4비트 또는 8비트 양자화를 적용하면 단일 서버급 GPU에서도 실시간으로 실행할 수 있습니다. 반면, 더 대규모인 Llama 4 Maverick은 고성능 연산을 위해 BF16과 FP8 형식으로 제공합니다. 이 모델들은 모델 저장소에서 제공되는 사용자 지정 Llama 4 커뮤니티 라이선스 계약에 따라 출시됩니다.

모든 원본 Llama 체크포인트는 hugging face meta-llama 페이지에서 확인하실 수 있습니다.

Llama 4 모델 패밀리는 두 가지 형태로 제공됩니다: 109B와 402B 매개변수입니다. 이 두 형태 모두 매우 큰 모델이며 일반적인 기기에서는 실행할 수 없습니다. 아래에 메모리 사용량을 줄이는 방법 몇 가지를 정리했습니다.

더욱 빠르고 안정적인 다운로드를 위해 hf_xet 종속성 설치를 권장합니다: pip install transformers[hf_xet]

아래 예시들은 Pipeline 또는 AutoModel로 생성하는 방법을 보여줍니다. 또한 일부 Llama 4 변형이 최대 1천만 토큰의 컨텍스트 길이를 갖기 때문에, 매우 긴 컨텍스트 생성을 활성화하기 위해 올바른 속성을 토글하는 방법을 보여주는 예시도 추가했습니다.

Pipeline

AutoModel - Text only

AutoModel - Multimodal

AutoModel - Multimodal with multiple images

AutoModel - Long context

효율성; Llama 4의 최대 성능 활용하기

어텐션 방법

기본 설정으로 주어지는 어텐션 함수를 변경하면 계산 성능과 메모리 사용량을 크게 개선할 수 있습니다. 인터페이스에 대한 자세한 설명은 어텐션 인터페이스 개요를 참조하세요.

Llama 4 모델은 처음 공개될 때부터 다음 어텐션 방식을 지원합니다: eager, flex_attention, sdpa. 최상의 결과를 위해 flex_attention 사용을 권장합니다. 어텐션 메커니즘 전환은 모델을 초기화할 때 이루어집니다:

Flex Attention

SDPA

Eager

양자화

양자화는 가중치를 더 낮은 정밀도로 바꿔 대형 모델의 메모리 부담을 줄입니다. 사용 가능한 양자화 백엔드에 대해서는 양자화 개요를 참조하세요. 현재는 FBGEMM과 LLM-Compressor를 지원하며, 곧 더 많은 방식이 추가될 예정입니다.

두 가지 방법을 사용하는 예시를 아래에서 확인하세요:

다음은 FBGEMM 접근법을 사용하여 BF16 모델을 FP8로 로드하는 예시입니다:

FBGEMM

LLM-Compressor

오프로딩

CPU 오프로딩을 활성화하면, GPU 메모리가 부족할 때 모델이 구성 요소를 CPU로 이동시킵니다. 추론 시 다양한 구성 요소들이 GPU와 CPU 간에 동적으로 로드되고 언로드됩니다. 이를 통해 CPU 메모리가 충분한 한 더 작은 머신에서도 모델을 로드할 수 있습니다. 다만 통신 오버헤드로 인해 추론 속도가 느려질 수 있습니다.

CPU 오프로딩을 활성화하려면 모델 로드 시 device_map을 auto로 지정하면 됩니다

from transformers import Llama4ForConditionalGeneration
import torch

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
)

Transformers

Llama4

효율성; Llama 4의 최대 성능 활용하기

어텐션 방법

양자화

오프로딩

Llama4Config

class transformers.Llama4Config

Llama4TextConfig

class transformers.Llama4TextConfig

Llama4VisionConfig

class transformers.Llama4VisionConfig

Llama4Processor

class transformers.Llama4Processor

Llama4ImageProcessorFast

class transformers.Llama4ImageProcessorFast

preprocess

rescale_and_normalize

Llama4ForConditionalGeneration

class transformers.Llama4ForConditionalGeneration

forward

get_image_features

get_placeholder_mask

Llama4ForCausalLM

class transformers.Llama4ForCausalLM

forward

Llama4TextModel

class transformers.Llama4TextModel

forward

Llama4ForCausalLM

class transformers.Llama4ForCausalLM

forward

Llama4VisionModel

class transformers.Llama4VisionModel

forward

get_input_embeddings