Hub Python Library documentation

Serialization

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.23.4).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Serialization

huggingface_hub contains helpers to help ML libraries serialize models weights in a standardized way. This part of the lib is still under development and will be improved in future releases. The goal is to harmonize how weights are serialized on the Hub, both to remove code duplication across libraries and to foster conventions on the Hub.

Save torch state dict

The main helper of the serialization module takes a state dictionary as input (e.g. a mapping between layer names and related tensors), splits it into several shards while creating a proper index in the process and save everything to disk. At the moment, only torch tensors are supported. Under the hood, it delegates the logic to split the state dictionary to split_torch_state_dict_into_shards().

huggingface_hub.save_torch_state_dict

< >

( state_dict: Dict save_directory: Union safe_serialization: bool = True filename_pattern: Optional = None max_shard_size: Union = '5GB' )

Parameters

  • state_dict (Dict[str, torch.Tensor]) — The state dictionary to save.
  • save_directory (str or Path) — The directory in which the model will be saved.
  • safe_serialization (bool, optional) — Whether to save as safetensors, which is the default behavior. If False, the shards are saved as pickle. Safe serialization is recommended for security reasons. Saving as pickle is deprecated and will be removed in a future version.
  • filename_pattern (str, optional) — The pattern to generate the files names in which the model will be saved. Pattern must be a string that can be formatted with filename_pattern.format(suffix=...) and must contain the keyword suffix Defaults to "model{suffix}.safetensors" or pytorch_model{suffix}.bin depending on safe_serialization parameter.
  • max_shard_size (int or str, optional) — The maximum size of each shard, in bytes. Defaults to 5GB.

Save a model state dictionary to the disk.

The model state dictionary is split into shards so that each shard is smaller than a given size. The shards are saved in the save_directory with the given filename_pattern. If the model is too big to fit in a single shard, an index file is saved in the save_directory to indicate where each tensor is saved. This helper uses split_torch_state_dict_into_shards() under the hood. If safe_serialization is True, the shards are saved as safetensors (the default). Otherwise, the shards are saved as pickle.

Before saving the model, the save_directory is cleaned from any previous shard files.

If one of the model’s tensor is bigger than max_shard_size, it will end up in its own shard which will have a size greater than max_shard_size.

Example:

>>> from huggingface_hub import save_torch_state_dict
>>> model = ... # A PyTorch model

# Save state dict to "path/to/folder". The model will be split into shards of 5GB each and saved as safetensors.
>>> state_dict = model_to_save.state_dict()
>>> save_torch_state_dict(state_dict, "path/to/folder")

Split state dict into shards

The serialization module also contains low-level helpers to split a state dictionary into several shards, while creating a proper index in the process. These helpers are available for torch and tensorflow tensors and are designed to be easily extended to any other ML frameworks.

split_tf_state_dict_into_shards

huggingface_hub.split_tf_state_dict_into_shards

< >

( state_dict: Dict filename_pattern: str = 'tf_model{suffix}.h5' max_shard_size: Union = '5GB' ) StateDictSplit

Parameters

  • state_dict (Dict[str, Tensor]) — The state dictionary to save.
  • filename_pattern (str, optional) — The pattern to generate the files names in which the model will be saved. Pattern must be a string that can be formatted with filename_pattern.format(suffix=...) and must contain the keyword suffix Defaults to "tf_model{suffix}.h5".
  • max_shard_size (int or str, optional) — The maximum size of each shard, in bytes. Defaults to 5GB.

Returns

StateDictSplit

A StateDictSplit object containing the shards and the index to retrieve them.

Split a model state dictionary in shards so that each shard is smaller than a given size.

The shards are determined by iterating through the state_dict in the order of its keys. There is no optimization made to make each shard as close as possible to the maximum size passed. For example, if the limit is 10GB and we have tensors of sizes [6GB, 6GB, 2GB, 6GB, 2GB, 2GB] they will get sharded as [6GB], [6+2GB], [6+2+2GB] and not [6+2+2GB], [6+2GB], [6GB].

If one of the model’s tensor is bigger than max_shard_size, it will end up in its own shard which will have a size greater than max_shard_size.

split_torch_state_dict_into_shards

huggingface_hub.split_torch_state_dict_into_shards

< >

( state_dict: Dict filename_pattern: str = 'model{suffix}.safetensors' max_shard_size: Union = '5GB' ) StateDictSplit

Parameters

  • state_dict (Dict[str, torch.Tensor]) — The state dictionary to save.
  • filename_pattern (str, optional) — The pattern to generate the files names in which the model will be saved. Pattern must be a string that can be formatted with filename_pattern.format(suffix=...) and must contain the keyword suffix Defaults to "model{suffix}.safetensors".
  • max_shard_size (int or str, optional) — The maximum size of each shard, in bytes. Defaults to 5GB.

Returns

StateDictSplit

A StateDictSplit object containing the shards and the index to retrieve them.

Split a model state dictionary in shards so that each shard is smaller than a given size.

The shards are determined by iterating through the state_dict in the order of its keys. There is no optimization made to make each shard as close as possible to the maximum size passed. For example, if the limit is 10GB and we have tensors of sizes [6GB, 6GB, 2GB, 6GB, 2GB, 2GB] they will get sharded as [6GB], [6+2GB], [6+2+2GB] and not [6+2+2GB], [6+2GB], [6GB].

To save a model state dictionary to the disk, see save_torch_state_dict(). This helper uses split_torch_state_dict_into_shards under the hood.

If one of the model’s tensor is bigger than max_shard_size, it will end up in its own shard which will have a size greater than max_shard_size.

Example:

>>> import json
>>> import os
>>> from safetensors.torch import save_file as safe_save_file
>>> from huggingface_hub import split_torch_state_dict_into_shards

>>> def save_state_dict(state_dict: Dict[str, torch.Tensor], save_directory: str):
...     state_dict_split = split_torch_state_dict_into_shards(state_dict)
...     for filename, tensors in state_dict_split.filename_to_tensors.items():
...         shard = {tensor: state_dict[tensor] for tensor in tensors}
...         safe_save_file(
...             shard,
...             os.path.join(save_directory, filename),
...             metadata={"format": "pt"},
...         )
...     if state_dict_split.is_sharded:
...         index = {
...             "metadata": state_dict_split.metadata,
...             "weight_map": state_dict_split.tensor_to_filename,
...         }
...         with open(os.path.join(save_directory, "model.safetensors.index.json"), "w") as f:
...             f.write(json.dumps(index, indent=2))

split_state_dict_into_shards_factory

This is the underlying factory from which each framework-specific helper is derived. In practice, you are not expected to use this factory directly except if you need to adapt it to a framework that is not yet supported. If that is the case, please let us know by opening a new issue on the huggingface_hub repo.

huggingface_hub.split_state_dict_into_shards_factory

< >

( state_dict: Dict get_tensor_size: Callable filename_pattern: str get_storage_id: Callable = <function <lambda> at 0x7facbd1bf520> max_shard_size: Union = '5GB' ) StateDictSplit

Parameters

  • state_dict (Dict[str, Tensor]) — The state dictionary to save.
  • get_tensor_size (Callable[[Tensor], int]) — A function that returns the size of a tensor in bytes.
  • get_storage_id (Callable[[Tensor], Optional[Any]], optional) — A function that returns a unique identifier to a tensor storage. Multiple different tensors can share the same underlying storage. This identifier is guaranteed to be unique and constant for this tensor’s storage during its lifetime. Two tensor storages with non-overlapping lifetimes may have the same id.
  • filename_pattern (str, optional) — The pattern to generate the files names in which the model will be saved. Pattern must be a string that can be formatted with filename_pattern.format(suffix=...) and must contain the keyword suffix
  • max_shard_size (int or str, optional) — The maximum size of each shard, in bytes. Defaults to 5GB.

Returns

StateDictSplit

A StateDictSplit object containing the shards and the index to retrieve them.

Split a model state dictionary in shards so that each shard is smaller than a given size.

The shards are determined by iterating through the state_dict in the order of its keys. There is no optimization made to make each shard as close as possible to the maximum size passed. For example, if the limit is 10GB and we have tensors of sizes [6GB, 6GB, 2GB, 6GB, 2GB, 2GB] they will get sharded as [6GB], [6+2GB], [6+2+2GB] and not [6+2+2GB], [6+2GB], [6GB].

If one of the model’s tensor is bigger than max_shard_size, it will end up in its own shard which will have a size greater than max_shard_size.

Helpers

get_torch_storage_id

huggingface_hub.get_torch_storage_id

< >

( tensor: torch.Tensor )

Return unique identifier to a tensor storage.

Multiple different tensors can share the same underlying storage. For example, “meta” tensors all share the same storage, and thus their identifier will all be equal. This identifier is guaranteed to be unique and constant for this tensor’s storage during its lifetime. Two tensor storages with non-overlapping lifetimes may have the same id.

Taken from https://github.com/huggingface/transformers/blob/1ecf5f7c982d761b4daaa96719d162c324187c64/src/transformers/pytorch_utils.py#L278.

< > Update on GitHub