FileSystems Integration for cloud storages¶

Supported Filesystems¶

Currenlty datasets offers an s3 filesystem implementation with datasets.filesystems.S3FileSystem. S3FileSystem is a subclass of s3fs.S3FileSystem, which is a known implementation of fsspec.

Furthermore datasets supports all fsspec implementations. Currently known implementations are:

s3fs for Amazon S3 and other compatible stores
gcsfs for Google Cloud Storage
adl for Azure DataLake storage
abfs for Azure Blob service
dropbox for access to dropbox shares
gdrive to access Google Drive and shares (experimental)

These known implementations are going to be natively added in the near future within datasets, but you can use them already in a similar way to s3fs.

Examples:

Example using datasets.filesystems.S3FileSystem within datasets.

>>> pip install datasets[s3]

Listing files from a public s3 bucket.

>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True)  
>>> s3.ls('public-datasets/imdb/train')  
['dataset_info.json.json','dataset.arrow','state.json']

Listing files from a private s3 bucket using aws_access_key_id and aws_secret_access_key.

>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
>>> s3.ls('my-private-datasets/imdb/train')  
['dataset_info.json.json','dataset.arrow','state.json']

Using S3FileSystem with botocore.session.Session and custom AWS profile.

>>> import botocore
>>> from datasets.filesystems import S3FileSystem
>>> s3_session = botocore.session.Session(profile='my_profile_name')
>>> s3 = S3FileSystem(session=s3_session)  

Example using a another fsspec implementations, like gcsfs within datasets.

>>> conda install -c conda-forge gcsfs
>>> # or
>>> pip install gcsfs

>>> import gcsfs
>>> gcs = gcsfs.GCSFileSystem(project='my-google-project') 
>>>
>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs)  

Saving a processed dataset to s3¶

Once you have your final dataset you can save it to s3 and reuse it later using datasets.load_from_disk. Saving a dataset to s3 will upload various files to your bucket:

arrow files: they contain your dataset’s data
dataset_info.json: contains the description, citations, etc. of the dataset
state.json: contains the list of the arrow files and other informations like the dataset format type, if any (torch or tensorflow for example)

Saving encoded_dataset to a private s3 bucket using aws_access_key_id and aws_secret_access_key.

>>> from datasets.filesystems import S3FileSystem
>>>
>>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key
>>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
>>>
>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3)  

Saving encoded_dataset to a private s3 bucket using botocore.session.Session and custom AWS profile.

>>> import botocore
>>> from datasets.filesystems import S3FileSystem
>>>
>>> # creates a botocore session with the provided AWS profile
>>> s3_session = botocore.session.Session(profile='my_profile_name')
>>>
>>> # create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)  
>>>
>>> # saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3)  

Loading a processed dataset from s3¶

After you have saved your processed dataset to s3 you can load it using datasets.load_from_disk. You can only load datasets from s3, which are saved using datasets.Dataset.save_to_disk() and datasets.DatasetDict.save_to_disk().

Loading encoded_dataset from a public s3 bucket.

>>> from datasets import load_from_disk
>>> from datasets.filesystems import S3FileSystem
>>>
>>> # create S3FileSystem without credentials
>>> s3 = S3FileSystem(anon=True)  
>>>
>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3)  
>>>
>>> print(len(dataset))
>>> # 25000

Loading encoded_dataset from a private s3 bucket using aws_access_key_id and aws_secret_access_key.

>>> from datasets import load_from_disk
>>> from datasets.filesystems import S3FileSystem
>>>
>>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key
>>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
>>>
>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3)  
>>>
>>> print(len(dataset))
>>> # 25000

Loading encoded_dataset from a private s3 bucket using botocore.session.Session and custom AWS profile.

>>> import botocore
>>> from datasets.filesystems import S3FileSystem
>>>
>>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key
>>> s3_session = botocore.session.Session(profile='my_profile_name')
>>>
>>> # create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)
>>>
>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3)  
>>>
>>> print(len(dataset))
>>> # 25000