Cloud storage
==============
🤗 Datasets supports access to cloud storage providers through a S3 filesystem implementation: :class:`datasets.filesystems.S3FileSystem`. You can save and load datasets from your Amazon S3 bucket in a Pythonic way. Take a look at the following table for other supported cloud storage providers:
.. list-table::
:header-rows: 1
* - Storage provider
- Filesystem implementation
* - Amazon S3
- `s3fs `_
* - Google Cloud Storage
- `gcsfs `_
* - Azure DataLake
- `adl `_
* - Azure Blob
- `abfs `_
* - Dropbox
- `dropboxdrivefs `_
* - Google Drive
- `gdrivefs `_
This guide will show you how to save and load datasets with **s3fs** to a S3 bucket, but other filesystem implementations can be used similarly.
Listing datasets
----------------
1. Install the S3 dependecy with 🤗 Datasets:
.. code::
>>> pip install datasets[s3]
2. List files from a public S3 bucket with ``s3.ls``:
.. code-block::
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True)
>>> s3.ls('public-datasets/imdb/train')
['dataset_info.json.json','dataset.arrow','state.json']
Access a private S3 bucket by entering your ``aws_access_key_id`` and ``aws_secret_access_key``:
.. code-block::
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)
>>> s3.ls('my-private-datasets/imdb/train')
['dataset_info.json.json','dataset.arrow','state.json']
Google Cloud Storage
^^^^^^^^^^^^^^^^^^^^
Other filesystem implementations, like Google Cloud Storage, are used similarly:
1. Install the Google Cloud Storage implementation:
.. code-block::
>>> conda install -c conda-forge gcsfs
# or install with pip
>>> pip install gcsfs
2. Load your dataset:
.. code-block::
>>> import gcsfs
>>> gcs = gcsfs.GCSFileSystem(project='my-google-project')
>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs)
Saving datasets
---------------
After you have processed your dataset, you can save it to S3 with :func:`datasets.Dataset.save_to_disk`:
.. code-block::
>>> from datasets.filesystems import S3FileSystem
>>> # create S3FileSystem instance
>>> s3 = S3FileSystem(anon=True)
>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train', fs=s3)
.. tip::
Remember to include your ``aws_access_key_id`` and ``aws_secret_access_key`` whenever you are interacting with a private S3 bucket.
Save your dataset with ``botocore.session.Session`` and a custom AWS profile:
.. code-block::
>>> import botocore
>>> from datasets.filesystems import S3FileSystem
>>> # creates a botocore session with the provided AWS profile
>>> s3_session = botocore.session.Session(profile='my_profile_name')
>>> # create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)
>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3)
Loading datasets
----------------
When you are ready to use your dataset again, reload it with :obj:`datasets.load_from_disk`:
.. code-block::
>>> from datasets import load_from_disk
>>> from datasets.filesystems import S3FileSystem
>>> # create S3FileSystem without credentials
>>> s3 = S3FileSystem(anon=True)
>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3)
>>> print(len(dataset))
>>> # 25000
Load with ``botocore.session.Session`` and custom AWS profile:
.. code-block::
>>> import botocore
>>> from datasets.filesystems import S3FileSystem
>>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key
>>> s3_session = botocore.session.Session(profile='my_profile_name')
>>> # create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)
>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3)
>>> print(len(dataset))
>>> # 25000