Cloud storage

🤗 Datasets supports access to cloud storage providers through a S3 filesystem implementation: datasets.filesystems.S3FileSystem. You can save and load datasets from your Amazon S3 bucket in a Pythonic way. Take a look at the following table for other supported cloud storage providers:

Storage provider

Filesystem implementation

Amazon S3

s3fs

Google Cloud Storage

gcsfs

Azure DataLake

adl

Azure Blob

abfs

Dropbox

dropboxdrivefs

Google Drive

gdrivefs

This guide will show you how to save and load datasets with s3fs to a S3 bucket, but other filesystem implementations can be used similarly.

Listing datasets

  1. Install the S3 dependency with 🤗 Datasets:

>>> pip install datasets[s3]
  1. List files from a public S3 bucket with s3.ls:

>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True)
>>> s3.ls('public-datasets/imdb/train')
['dataset_info.json.json','dataset.arrow','state.json']

Access a private S3 bucket by entering your aws_access_key_id and aws_secret_access_key:

>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)
>>> s3.ls('my-private-datasets/imdb/train')
['dataset_info.json.json','dataset.arrow','state.json']

Google Cloud Storage

Other filesystem implementations, like Google Cloud Storage, are used similarly:

  1. Install the Google Cloud Storage implementation:

>>> conda install -c conda-forge gcsfs
# or install with pip
>>> pip install gcsfs
  1. Load your dataset:

>>> import gcsfs
>>> gcs = gcsfs.GCSFileSystem(project='my-google-project')

>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs)

Saving datasets

After you have processed your dataset, you can save it to S3 with datasets.Dataset.save_to_disk():

>>> from datasets.filesystems import S3FileSystem

>>> # create S3FileSystem instance
>>> s3 = S3FileSystem(anon=True)

>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train', fs=s3)

Tip

Remember to include your aws_access_key_id and aws_secret_access_key whenever you are interacting with a private S3 bucket.

Save your dataset with botocore.session.Session and a custom AWS profile:

>>> import botocore
>>> from datasets.filesystems import S3FileSystem

>>> # creates a botocore session with the provided AWS profile
>>> s3_session = botocore.session.Session(profile='my_profile_name')

>>> # create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)

>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3)

Loading datasets

When you are ready to use your dataset again, reload it with datasets.load_from_disk:

>>> from datasets import load_from_disk
>>> from datasets.filesystems import S3FileSystem

>>> # create S3FileSystem without credentials
>>> s3 = S3FileSystem(anon=True)

>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3)

>>> print(len(dataset))
>>> # 25000

Load with botocore.session.Session and custom AWS profile:

>>> import botocore
>>> from datasets.filesystems import S3FileSystem

>>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key
>>> s3_session = botocore.session.Session(profile='my_profile_name')

>>> # create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)

>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3)

>>> print(len(dataset))
>>> # 25000