Hub Python Library documentation

Cache-system reference

You are viewing v0.11.0 version. A newer version v0.27.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Cache-system reference

The caching system was updated in v0.8.0 to become the central cache-system shared across libraries that depend on the Hub. Read the cache-system guide for a detailed presentation of caching at HF.

Helpers

cached_assets_path

huggingface_hub.cached_assets_path

< >

( library_name: str namespace: str = 'default' subfolder: str = 'default' assets_dir: typing.Union[str, pathlib.Path, NoneType] = None )

Parameters

  • library_name (str) — Name of the library that will manage the cache folder. Example: "dataset".
  • namespace (str, optional, defaults to “default”) — Namespace to which the data belongs. Example: "SQuAD".
  • subfolder (str, optional, defaults to “default”) — Subfolder in which the data will be stored. Example: extracted.
  • assets_dir (str, Path, optional) — Path to the folder where assets are cached. This must not be the same folder where Hub files are cached. Defaults to HF_HOME / "assets" if not provided. Can also be set with HUGGINGFACE_ASSETS_CACHE environment variable.

Return a folder path to cache arbitrary files.

huggingface_hub provides a canonical folder path to store assets. This is the recommended way to integrate cache in a downstream library as it will benefit from the builtins tools to scan and delete the cache properly.

The distinction is made between files cached from the Hub and assets. Files from the Hub are cached in a git-aware manner and entirely managed by huggingface_hub. See related documentation. All other files that a downstream library caches are considered to be β€œassets” (files downloaded from external sources, extracted from a .tar archive, preprocessed for training,…).

Once the folder path is generated, it is guaranteed to exist and to be a directory. The path is based on 3 levels of depth: the library name, a namespace and a subfolder. Those 3 levels grants flexibility while allowing huggingface_hub to expect folders when scanning/deleting parts of the assets cache. Within a library, it is expected that all namespaces share the same subset of subfolder names but this is not a mandatory rule. The downstream library has then full control on which file structure to adopt within its cache. Namespace and subfolder are optional (would default to a "default/" subfolder) but library name is mandatory as we want every downstream library to manage its own cache.

Expected tree:

    assets/
    └── datasets/
    β”‚   β”œβ”€β”€ SQuAD/
    β”‚   β”‚   β”œβ”€β”€ downloaded/
    β”‚   β”‚   β”œβ”€β”€ extracted/
    β”‚   β”‚   └── processed/
    β”‚   β”œβ”€β”€ Helsinki-NLP--tatoeba_mt/
    β”‚       β”œβ”€β”€ downloaded/
    β”‚       β”œβ”€β”€ extracted/
    β”‚       └── processed/
    └── transformers/
        β”œβ”€β”€ default/
        β”‚   β”œβ”€β”€ something/
        β”œβ”€β”€ bert-base-cased/
        β”‚   β”œβ”€β”€ default/
        β”‚   └── training/
    hub/
    └── models--julien-c--EsperBERTo-small/
        β”œβ”€β”€ blobs/
        β”‚   β”œβ”€β”€ (...)
        β”‚   β”œβ”€β”€ (...)
        β”œβ”€β”€ refs/
        β”‚   └── (...)
        └── [ 128]  snapshots/
            β”œβ”€β”€ 2439f60ef33a0d46d85da5001d52aeda5b00ce9f/
            β”‚   β”œβ”€β”€ (...)
            └── bbc77c8132af1cc5cf678da3f1ddf2de43606d48/
                └── (...)

Example:

>>> from huggingface_hub import cached_assets_path

>>> cached_assets_path(library_name="datasets", namespace="SQuAD", subfolder="download")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/SQuAD/download')

>>> cached_assets_path(library_name="datasets", namespace="SQuAD", subfolder="extracted")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/SQuAD/extracted')

>>> cached_assets_path(library_name="datasets", namespace="Helsinki-NLP/tatoeba_mt")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/Helsinki-NLP--tatoeba_mt/default')

>>> cached_assets_path(library_name="datasets", assets_dir="/tmp/tmp123456")
PosixPath('/tmp/tmp123456/datasets/default/default')

scan_cache_dir

huggingface_hub.scan_cache_dir

< >

( cache_dir: typing.Union[str, pathlib.Path, NoneType] = None )

Parameters

  • cache_dir (str or Path, optional) — Cache directory to cache. Defaults to the default HF cache directory.

Raises

CacheNotFound or ValueError

  • CacheNotFound β€” If the cache directory does not exist.

  • ValueError β€” If the cache directory is a file, instead of a directory.

Scan the entire HF cache-system and return a ~HFCacheInfo structure.

Use scan_cache_dir in order to programmatically scan your cache-system. The cache will be scanned repo by repo. If a repo is corrupted, a ~CorruptedCacheException will be thrown internally but captured and returned in the ~HFCacheInfo structure. Only valid repos get a proper report.

>>> from huggingface_hub import scan_cache_dir

>>> hf_cache_info = scan_cache_dir()
HFCacheInfo(
    size_on_disk=3398085269,
    repos=frozenset({
        CachedRepoInfo(
            repo_id='t5-small',
            repo_type='model',
            repo_path=PosixPath(...),
            size_on_disk=970726914,
            nb_files=11,
            revisions=frozenset({
                CachedRevisionInfo(
                    commit_hash='d78aea13fa7ecd06c29e3e46195d6341255065d5',
                    size_on_disk=970726339,
                    snapshot_path=PosixPath(...),
                    files=frozenset({
                        CachedFileInfo(
                            file_name='config.json',
                            size_on_disk=1197
                            file_path=PosixPath(...),
                            blob_path=PosixPath(...),
                        ),
                        CachedFileInfo(...),
                        ...
                    }),
                ),
                CachedRevisionInfo(...),
                ...
            }),
        ),
        CachedRepoInfo(...),
        ...
    }),
    warnings=[
        CorruptedCacheException("Snapshots dir doesn't exist in cached repo: ..."),
        CorruptedCacheException(...),
        ...
    ],
)

You can also print a detailed report directly from the huggingface-cli using:

> huggingface-cli scan-cache
REPO ID                     REPO TYPE SIZE ON DISK NB FILES REFS                LOCAL PATH
--------------------------- --------- ------------ -------- ------------------- -------------------------------------------------------------------------
glue                        dataset         116.3K       15 1.17.0, main, 2.4.0 /Users/lucain/.cache/huggingface/hub/datasets--glue
google/fleurs               dataset          64.9M        6 main, refs/pr/1     /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs
Jean-Baptiste/camembert-ner model           441.0M        7 main                /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner
bert-base-cased             model             1.9G       13 main                /Users/lucain/.cache/huggingface/hub/models--bert-base-cased
t5-base                     model            10.1K        3 main                /Users/lucain/.cache/huggingface/hub/models--t5-base
t5-small                    model           970.7M       11 refs/pr/1, main     /Users/lucain/.cache/huggingface/hub/models--t5-small

Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
Got 1 warning(s) while scanning. Use -vvv to print details.

Returns: a ~HFCacheInfo object.

Data structures

All structures are built and returned by scan_cache_dir() and are immutable.

HFCacheInfo

class huggingface_hub.HFCacheInfo

< >

( size_on_disk: int repos: typing.FrozenSet[huggingface_hub.utils._cache_manager.CachedRepoInfo] warnings: typing.List[huggingface_hub.utils._cache_manager.CorruptedCacheException] )

Parameters

  • size_on_disk (int) — Sum of all valid repo sizes in the cache-system.
  • repos (FrozenSet[CachedRepoInfo]) — Set of ~CachedRepoInfo describing all valid cached repos found on the cache-system while scanning.
  • warnings (List[CorruptedCacheException]) — List of ~CorruptedCacheException that occurred while scanning the cache. Those exceptions are captured so that the scan can continue. Corrupted repos are skipped from the scan.

Frozen data structure holding information about the entire cache-system.

This data structure is returned by scan_cache_dir() and is immutable.

Here size_on_disk is equal to the sum of all repo sizes (only blobs). However if some cached repos are corrupted, their sizes are not taken into account.

delete_revisions

< >

( *revisions: str )

Prepare the strategy to delete one or more revisions cached locally.

Input revisions can be any revision hash. If a revision hash is not found in the local cache, a warning is thrown but no error is raised. Revisions can be from different cached repos since hashes are unique across repos,

Examples:

>>> from huggingface_hub import scan_cache_dir
>>> cache_info = scan_cache_dir()
>>> delete_strategy = cache_info.delete_revisions(
...     "81fd1d6e7847c99f5862c9fb81387956d99ec7aa"
... )
>>> print(f"Will free {delete_strategy.expected_freed_size_str}.")
Will free 7.9K.
>>> delete_strategy.execute()
Cache deletion done. Saved 7.9K.
>>> from huggingface_hub import scan_cache_dir
>>> scan_cache_dir().delete_revisions(
...     "81fd1d6e7847c99f5862c9fb81387956d99ec7aa",
...     "e2983b237dccf3ab4937c97fa717319a9ca1a96d",
...     "6c0e6080953db56375760c0471a8c5f2929baf11",
... ).execute()
Cache deletion done. Saved 8.6G.

delete_revisions returns a DeleteCacheStrategy object that needs to be executed. The DeleteCacheStrategy is not meant to be modified but allows having a dry run before actually executing the deletion.

CachedRepoInfo

class huggingface_hub.CachedRepoInfo

< >

( repo_id: str repo_type: typing.Literal['model', 'dataset', 'space'] repo_path: Path size_on_disk: int nb_files: int revisions: typing.FrozenSet[huggingface_hub.utils._cache_manager.CachedRevisionInfo] last_accessed: float last_modified: float )

Parameters

  • repo_id (str) — Repo id of the repo on the Hub. Example: "google/fleurs".
  • repo_type (Literal["dataset", "model", "space"]) — Type of the cached repo.
  • repo_path (Path) — Local path to the cached repo.
  • size_on_disk (int) — Sum of the blob file sizes in the cached repo.
  • nb_files (int) — Total number of blob files in the cached repo.
  • revisions (FrozenSet[CachedRevisionInfo]) — Set of ~CachedRevisionInfo describing all revisions cached in the repo.
  • last_accessed (float) — Timestamp of the last time a blob file of the repo has been accessed.
  • last_modified (float) — Timestamp of the last time a blob file of the repo has been modified/created.

Frozen data structure holding information about a cached repository.

size_on_disk is not necessarily the sum of all revisions sizes because of duplicated files. Besides, only blobs are taken into account, not the (negligible) size of folders and symlinks.

last_accessed and last_modified reliability can depend on the OS you are using. See python documentation for more details.

size_on_disk_str

< >

( )

(property) Sum of the blob file sizes as a human-readable string.

Example: β€œ42.2K”.

refs

< >

( )

(property) Mapping between refs and revision data structures.

CachedRevisionInfo

class huggingface_hub.CachedRevisionInfo

< >

( commit_hash: str snapshot_path: Path size_on_disk: int files: typing.FrozenSet[huggingface_hub.utils._cache_manager.CachedFileInfo] refs: typing.FrozenSet[str] last_modified: float )

Parameters

  • commit_hash (str) — Hash of the revision (unique). Example: "9338f7b671827df886678df2bdd7cc7b4f36dffd".
  • snapshot_path (Path) — Path to the revision directory in the snapshots folder. It contains the exact tree structure as the repo on the Hub. files — (FrozenSet[CachedFileInfo]): Set of ~CachedFileInfo describing all files contained in the snapshot.
  • refs (FrozenSet[str]) — Set of refs pointing to this revision. If the revision has no refs, it is considered detached. Example: {"main", "2.4.0"} or {"refs/pr/1"}.
  • size_on_disk (int) — Sum of the blob file sizes that are symlink-ed by the revision.
  • last_modified (float) — Timestamp of the last time the revision has been created/modified.

Frozen data structure holding information about a revision.

A revision correspond to a folder in the snapshots folder and is populated with the exact tree structure as the repo on the Hub but contains only symlinks. A revision can be either referenced by 1 or more refs or be β€œdetached” (no refs).

last_accessed cannot be determined correctly on a single revision as blob files are shared across revisions.

size_on_disk is not necessarily the sum of all file sizes because of possible duplicated files. Besides, only blobs are taken into account, not the (negligible) size of folders and symlinks.

size_on_disk_str

< >

( )

(property) Sum of the blob file sizes as a human-readable string.

Example: β€œ42.2K”.

nb_files

< >

( )

(property) Total number of files in the revision.

CachedFileInfo

class huggingface_hub.CachedFileInfo

< >

( file_name: str file_path: Path blob_path: Path size_on_disk: int blob_last_accessed: float blob_last_modified: float )

Parameters

  • file_name (str) — Name of the file. Example: config.json.
  • file_path (Path) — Path of the file in the snapshots directory. The file path is a symlink referring to a blob in the blobs folder.
  • blob_path (Path) — Path of the blob file. This is equivalent to file_path.resolve().
  • size_on_disk (int) — Size of the blob file in bytes.
  • blob_last_accessed (float) — Timestamp of the last time the blob file has been accessed (from any revision).
  • blob_last_modified (float) — Timestamp of the last time the blob file has been modified/created.

Frozen data structure holding information about a single cached file.

blob_last_accessed and blob_last_modified reliability can depend on the OS you are using. See python documentation for more details.

size_on_disk_str

< >

( )

(property) Size of the blob file as a human-readable string.

Example: β€œ42.2K”.

DeleteCacheStrategy

class huggingface_hub.DeleteCacheStrategy

< >

( expected_freed_size: int blobs: typing.FrozenSet[pathlib.Path] refs: typing.FrozenSet[pathlib.Path] repos: typing.FrozenSet[pathlib.Path] snapshots: typing.FrozenSet[pathlib.Path] )

Parameters

  • expected_freed_size (float) — Expected freed size once strategy is executed.
  • blobs (FrozenSet[Path]) — Set of blob file paths to be deleted.
  • refs (FrozenSet[Path]) — Set of reference file paths to be deleted.
  • repos (FrozenSet[Path]) — Set of entire repo paths to be deleted.
  • snapshots (FrozenSet[Path]) — Set of snapshots to be deleted (directory of symlinks).

Frozen data structure holding the strategy to delete cached revisions.

This object is not meant to be instantiated programmatically but to be returned by delete_revisions(). See documentation for usage example.

expected_freed_size_str

< >

( )

(property) Expected size that will be freed as a human-readable string.

Example: β€œ42.2K”.

Exceptions

CorruptedCacheException

class huggingface_hub.CorruptedCacheException

< >

( )

Exception for any unexpected structure in the Huggingface cache-system.