Datasets documentation

Cloud storage

You are viewing v2.4.0 version. A newer version v2.19.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Cloud storage

🤗 Datasets supports access to cloud storage providers through a S3 filesystem implementation: filesystems.S3FileSystem. You can save and load datasets from your Amazon S3 bucket in a Pythonic way. Take a look at the following table for other supported cloud storage providers:

Storage provider Filesystem implementation
Amazon S3 s3fs
Google Cloud Storage gcsfs
Azure Blob/DataLake adlfs
Dropbox dropboxdrivefs
Google Drive gdrivefs

This guide will show you how to save and load datasets with s3fs to a S3 bucket, but other filesystem implementations can be used similarly. An example is shown also for Google Cloud Storage and Azure Blob Storage.

Amazon S3

Listing datasets

  1. Install the S3 dependency with 🤗 Datasets:
>>> pip install datasets[s3]
  1. List files from a public S3 bucket with s3.ls:
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True)  
>>> s3.ls('public-datasets/imdb/train')
['dataset_info.json.json','dataset.arrow','state.json']

Access a private S3 bucket by entering your aws_access_key_id and aws_secret_access_key:

>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
>>> s3.ls('my-private-datasets/imdb/train')  
['dataset_info.json.json','dataset.arrow','state.json']

Saving datasets

After you have processed your dataset, you can save it to S3 with Dataset.save_to_disk():

>>> from datasets.filesystems import S3FileSystem

# create S3FileSystem instance
>>> s3 = S3FileSystem(anon=True)  

# saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train', fs=s3)

Remember to include your aws_access_key_id and aws_secret_access_key whenever you are interacting with a private S3 bucket.

Save your dataset with botocore.session.Session and a custom AWS profile:

>>> import botocore
>>> from datasets.filesystems import S3FileSystem

# creates a botocore session with the provided AWS profile
>>> s3_session = botocore.session.Session(profile='my_profile_name')

# create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)  

# saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3)

Loading datasets

When you are ready to use your dataset again, reload it with Dataset.load_from_disk():

>>> from datasets import load_from_disk
>>> from datasets.filesystems import S3FileSystem

# create S3FileSystem without credentials
>>> s3 = S3FileSystem(anon=True)  

# load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3)  

>>> print(len(dataset))
>>> # 25000

Load with botocore.session.Session and custom AWS profile:

>>> import botocore
>>> from datasets.filesystems import S3FileSystem

# create S3FileSystem instance with aws_access_key_id and aws_secret_access_key
>>> s3_session = botocore.session.Session(profile='my_profile_name')

# create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)

# load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3)  

>>> print(len(dataset))
>>> # 25000

Google Cloud Storage

  1. Install the Google Cloud Storage implementation:
>>> conda install -c conda-forge gcsfs
# or install with pip
>>> pip install gcsfs
  1. Save your dataset:
>>> import gcsfs

# create GCSFileSystem instance using default gcloud credentials with project
>>> gcs = gcsfs.GCSFileSystem(project='my-google-project')

# saves encoded_dataset to your gcs bucket
>>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs)
  1. Load your dataset:
>>> import gcsfs
>>> from datasets import load_from_disk

# create GCSFileSystem instance using default gcloud credentials with project
>>> gcs = gcsfs.GCSFileSystem(project='my-google-project')

# loads encoded_dataset from your gcs bucket
>>> dataset = load_from_disk('gcs://my-private-datasets/imdb/train', fs=gcs)

Azure Blob Storage

  1. Install the Azure Blob Storage implementation:
>>> conda install -c conda-forge adlfs
# or install with pip
>>> pip install adlfs
  1. Save your dataset:
>>> import adlfs

# create AzureBlobFileSystem instance with account_name and account_key
>>> abfs = adlfs.AzureBlobFileSystem(account_name="XXXX", account_key="XXXX")

# saves encoded_dataset to your azure container
>>> encoded_dataset.save_to_disk('abfs://my-private-datasets/imdb/train', fs=abfs)
  1. Load your dataset:
>>> import adlfs
>>> from datasets import load_from_disk

# create AzureBlobFileSystem instance with account_name and account_key
>>> abfs = adlfs.AzureBlobFileSystem(account_name="XXXX", account_key="XXXX")

# loads encoded_dataset from your azure container
>>> dataset = load_from_disk('abfs://my-private-datasets/imdb/train', fs=abfs)