Transformers documentation
Loading models
Loading models
Transformers provides many pretrained models that are ready to use with a single line of code. It requires a model class and the from_pretrained() method.
Call from_pretrained() to download and load a model’s weights and configuration stored on the Hugging Face Hub.
The from_pretrained() method loads weights stored in the safetensors file format if they’re available. Traditionally, PyTorch model weights are serialized with the pickle utility which is known to be unsecure. Safetensor files are more secure and faster to load.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", dtype="auto", device_map="auto")This guide explains how models are loaded, the different ways you can load a model, how to overcome memory issues for really big models, and how to load custom models.
Models and configurations
All models have a configuration.py file with specific attributes like the number of hidden layers, vocabulary size, activation function, and more. You’ll also find a modeling.py file that defines the layers and mathematical operations taking place inside each layer. The modeling.py file takes the model attributes in configuration.py and builds the model accordingly. At this point, you have a model with random weights that needs to be trained to output meaningful results.
An architecture refers to the model’s skeleton and a checkpoint refers to the model’s weights for a given architecture. For example, BERT is an architecture while google-bert/bert-base-uncased is a checkpoint. You’ll see the term model used interchangeably with architecture and checkpoint.
There are two general types of models you can load:
- A barebones model, like AutoModel or LlamaModel, that outputs hidden states.
- A model with a specific head attached, like AutoModelForCausalLM or LlamaForCausalLM, for performing specific tasks.
Model classes
To get a pretrained model, you need to load the weights into the model. This is done by calling from_pretrained() which accepts weights from the Hugging Face Hub or a local directory.
There are two model classes, the AutoModel class and a model-specific class.
The AutoModel class is a convenient way to load an architecture without needing to know the exact model class name because there are many models available. It automatically selects the correct model class based on the configuration file. You only need to know the task and checkpoint you want to use.
Easily switch between models or tasks, as long as the architecture is supported for a given task.
For example, the same model can be used for separate tasks.
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoModelForQuestionAnswering
# use the same API for 3 different tasks
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForSequenceClassification.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForQuestionAnswering.from_pretrained("meta-llama/Llama-2-7b-hf")In other cases, you may want to quickly try out several different models for a task.
from transformers import AutoModelForCausalLM
# use the same API to load 3 different models
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")Large models
Large pretrained models require a lot of memory to load. The loading process involves:
- creating a model with random weights
- loading the pretrained weights
- placing the pretrained weights on the model
You need enough memory to hold two copies of the model weights (random and pretrained) which may not be possible depending on your hardware. In distributed training environments, this is even more challenging because each process loads a pretrained model.
Transformers reduces some of these memory-related challenges with fast initialization, sharded checkpoints, Accelerate’s Big Model Inference feature, and supporting lower bit data types.
Sharded checkpoints
The save_pretrained() method automatically shards checkpoints larger than 10GB.
Each shard is loaded sequentially after the previous shard is loaded, limiting memory usage to only the model size and the largest shard size.
The max_shard_size parameter defaults to 5GB for each shard because it is easier to run on free-tier GPU instances without running out of memory.
For example, create some shards checkpoints for BioMistral/BioMistral-7B in save_pretrained().
from transformers import AutoModel
import tempfile
import os
model = AutoModel.from_pretrained("biomistral/biomistral-7b")
with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, max_shard_size="5GB")
    print(sorted(os.listdir(tmp_dir)))Reload the sharded checkpoint with from_pretrained().
with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir)
    new_model = AutoModel.from_pretrained(tmp_dir)Sharded checkpoints can also be directly loaded with load_sharded_checkpoint().
from transformers.modeling_utils import load_sharded_checkpoint
with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, max_shard_size="5GB")
    load_sharded_checkpoint(model, tmp_dir)The save_pretrained() method creates an index file that maps parameter names to the files they’re stored in. The index file has two keys, metadata and weight_map.
import json
with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, max_shard_size="5GB")
    with open(os.path.join(tmp_dir, "model.safetensors.index.json"), "r") as f:
        index = json.load(f)
print(index.keys())The metadata key provides the total model size.
index["metadata"]
{'total_size': 28966928384}The weight_map key maps each parameter to the shard it’s stored in.
index["weight_map"]
{'lm_head.weight': 'model-00006-of-00006.safetensors',
 'model.embed_tokens.weight': 'model-00001-of-00006.safetensors',
 'model.layers.0.input_layernorm.weight': 'model-00001-of-00006.safetensors',
 'model.layers.0.mlp.down_proj.weight': 'model-00001-of-00006.safetensors',
 ...
}Big Model Inference
Make sure you have Accelerate v0.9.0 and PyTorch v1.9.0 or later installed to use this feature!
from_pretrained() is supercharged with Accelerate’s Big Model Inference feature.
Big Model Inference creates a model skeleton on the PyTorch meta device. The meta device doesn’t store any real data, only the metadata.
Randomly initialized weights are only created when the pretrained weights are loaded to avoid maintaining two copies of the model in memory at the same time. The maximum memory usage is only the size of the model.
Learn more about device placement in Designing a device map.
Big Model Inference’s second feature relates to how weights are loaded and dispatched in the model skeleton. Model weights are dispatched across all available devices, starting with the fastest device (usually the GPU) and then offloading any remaining weights to slower devices (CPU and hard drive).
Both features combined reduces memory usage and loading times for big pretrained models.
Set device_map to "auto" to enable Big Model Inference.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto")You can also manually assign layers to a device in device_map. It should map all model parameters to a device, but you don’t have to detail where all the submodules of a layer go if the entire layer is on the same device.
Access the hf_device_map attribute to see how a model is distributed across devices.
device_map = {"model.layers.1": 0, "model.layers.14": 1, "model.layers.31": "cpu", "lm_head": "disk"}
model.hf_device_mapModel data type
PyTorch model weights are initialized in torch.float32 by default. Loading a model in a different data type, like torch.float16, requires additional memory because the model is loaded again in the desired data type.
Explicitly set the dtype parameter to directly initialize the model in the desired data type instead of loading the weights twice (torch.float32 then torch.float16). You could also set dtype="auto" to automatically load the weights in the data type they are stored in.
import torch
from transformers import AutoModelForCausalLM
gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", dtype=torch.float16)The dtype parameter can also be configured in AutoConfig for models instantiated from scratch.
import torch
from transformers import AutoConfig, AutoModel
my_config = AutoConfig.from_pretrained("google/gemma-2b", dtype=torch.float16)
model = AutoModel.from_config(my_config)Custom models
Custom models builds on Transformers’ configuration and modeling classes, supports the AutoClass API, and are loaded with from_pretrained(). The difference is that the modeling code is not from Transformers.
Take extra precaution when loading a custom model. While the Hub includes malware scanning for every repository, you should still be careful to avoid inadvertently executing malicious code.
Set trust_remote_code=True in from_pretrained() to load a custom model.
from transformers import AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained("sgugger/custom-resnet50d", trust_remote_code=True)As an extra layer of security, load a custom model from a specific revision to avoid loading model code that may have changed. The commit hash can be copied from the models commit history.
commit_hash = "ed94a7c6247d8aedce4647f00f20de6875b5b292"
model = AutoModelForImageClassification.from_pretrained(
    "sgugger/custom-resnet50d", trust_remote_code=True, revision=commit_hash
)Refer to the Customize models guide for more information.
Update on GitHub