Datasets documentation

Cloud storage

You are viewing v2.2.1 version. A newer version v3.2.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Cloud storage

πŸ€— Datasets supports access to cloud storage providers through a S3 filesystem implementation: filesystems.S3FileSystem. You can save and load datasets from your Amazon S3 bucket in a Pythonic way. Take a look at the following table for other supported cloud storage providers:

Storage provider Filesystem implementation
Amazon S3 s3fs
Google Cloud Storage gcsfs
Azure DataLake adl
Azure Blob abfs
Dropbox dropboxdrivefs
Google Drive gdrivefs

This guide will show you how to save and load datasets with s3fs to a S3 bucket, but other filesystem implementations can be used similarly.

Listing datasets

  1. Install the S3 dependency with πŸ€— Datasets:
>>> pip install datasets[s3]
  1. List files from a public S3 bucket with s3.ls:
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True)  
>>> s3.ls('public-datasets/imdb/train')
['dataset_info.json.json','dataset.arrow','state.json']

Access a private S3 bucket by entering your aws_access_key_id and aws_secret_access_key:

>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
>>> s3.ls('my-private-datasets/imdb/train')  
['dataset_info.json.json','dataset.arrow','state.json']

Google Cloud Storage

Other filesystem implementations, like Google Cloud Storage, are used similarly:

  1. Install the Google Cloud Storage implementation:
>>> conda install -c conda-forge gcsfs
# or install with pip
>>> pip install gcsfs
  1. Load your dataset:
>>> import gcsfs
>>> gcs = gcsfs.GCSFileSystem(project='my-google-project') 

>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs)

Saving datasets

After you have processed your dataset, you can save it to S3 with Dataset.save_to_disk():

>>> from datasets.filesystems import S3FileSystem

>>> # create S3FileSystem instance
>>> s3 = S3FileSystem(anon=True)  

>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train', fs=s3)

Remember to include your aws_access_key_id and aws_secret_access_key whenever you are interacting with a private S3 bucket.

Save your dataset with botocore.session.Session and a custom AWS profile:

>>> import botocore
>>> from datasets.filesystems import S3FileSystem

>>> # creates a botocore session with the provided AWS profile
>>> s3_session = botocore.session.Session(profile='my_profile_name')

>>> # create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)  

>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3)

Loading datasets

When you are ready to use your dataset again, reload it with Dataset.load_from_disk():

>>> from datasets import load_from_disk
>>> from datasets.filesystems import S3FileSystem

>>> # create S3FileSystem without credentials
>>> s3 = S3FileSystem(anon=True)  

>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3)  

>>> print(len(dataset))
>>> # 25000

Load with botocore.session.Session and custom AWS profile:

>>> import botocore
>>> from datasets.filesystems import S3FileSystem

>>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key
>>> s3_session = botocore.session.Session(profile='my_profile_name')

>>> # create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)

>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3)  

>>> print(len(dataset))
>>> # 25000