Use with PyTorch

This document is a quick introduction to using datasets with PyTorch, with a particular focus on how to get torch.Tensor objects out of our datasets, and how to use a PyTorch DataLoader and a Hugging Face Dataset with the best performance.

Dataset format

By default, datasets return regular python objects: integers, floats, strings, lists, etc.

To get PyTorch tensors instead, you can set the format of the dataset to pytorch using Dataset.with_format():

>>> from datasets import Dataset
>>> data = [[1, 2],[3, 4]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': tensor([1, 2])}
>>> ds[:2]
{'data': tensor([[1, 2],
         [3, 4]])}

A Dataset object is a wrapper of an Arrow table, which allows fast zero-copy reads from arrays in the dataset to PyTorch tensors.

To load the data as tensors on a GPU, specify the device argument:

>>> import torch
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> ds = ds.with_format("torch", device=device)
>>> ds[0]
{'data': tensor([1, 2], device='cuda:0')}

N-dimensional arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists. In particular, a PyTorch formatted dataset outputs nested lists instead of a single tensor:

>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': [tensor([1, 2]), tensor([3, 4])]}

To get a single tensor, you must explicitly use the Array feature type and specify the shape of your tensors:

>>> from datasets import Dataset, Features, Array2D
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')})
>>> ds = Dataset.from_dict({"data": data}, features=features)
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': tensor([[1, 2],
         [3, 4]])}
>>> ds[:2]
{'data': tensor([[[1, 2],
          [3, 4]],
 
         [[5, 6],
          [7, 8]]])}

Other feature types

ClassLabel data are properly converted to tensors:

>>> from datasets import Dataset, Features, ClassLabel
>>> data = [0, 0, 1]
>>> features = Features({"data": ClassLabel(names=["negative", "positive"])})
>>> ds = Dataset.from_dict({"data": data}, features=features) 
>>> ds = ds.with_format("torch")  
>>> ds[:3]
{'data': tensor([0, 0, 1])}

However, since it’s not possible to convert text data to PyTorch tensors, you can’t format a string column to PyTorch. Instead, you can explicitly format certain columns and leave the other columns unformatted:

>>> from datasets import Dataset, Features 
>>> text = ["foo", "bar"]
>>> data = [0, 1] 
>>> ds = Dataset.from_dict({"text": text, "data": data})  
>>> ds = ds.with_format("torch", columns=["data"], output_all_columns=True) 
>>> ds[:2]                                                                                                                                                     
{'data': tensor([0, 1]), 'text': ['foo', 'bar']}

The Image and Audio feature types are not supported yet.

Data loading

Like torch.utils.data.Dataset objects, a Dataset can be passed directly to a PyTorch DataLoader:

>>> import numpy as np
>>> from datasets import Dataset 
>>> from torch.utils.data import DataLoader
>>> data = np.random.rand(16)
>>> label = np.random.randint(0, 2, size=16)
>>> ds = Dataset.from_dict({"data": data, "label": label}).with_format("torch")
>>> dataloader = DataLoader(ds, batch_size=4)
>>> for batch in dataloader:
...     print(batch)                                                                                            
{'data': tensor([0.0047, 0.4979, 0.6726, 0.8105]), 'label': tensor([0, 1, 0, 1])}
{'data': tensor([0.4832, 0.2723, 0.4259, 0.2224]), 'label': tensor([0, 0, 0, 0])}
{'data': tensor([0.5837, 0.3444, 0.4658, 0.6417]), 'label': tensor([0, 1, 0, 0])}
{'data': tensor([0.7022, 0.1225, 0.7228, 0.8259]), 'label': tensor([1, 1, 1, 1])}

Optimize data loading

There are several ways you can increase the speed your data is loaded which can save you time, especially if you are working with large datasets. PyTorch offers parallelized data loading, retrieving batches of indices instead of individually, and streaming to progressively download datasets.

Use multiple Workers

You can parallelize data loading with the num_workers argument of a PyTorch DataLoader and get a higher throughput.

Under the hood, the DataLoader starts num_workers processes. Each process reloads the dataset passed to the DataLoader and is used to query examples. Reloading the dataset inside a worker doesn’t fill up your RAM, since it simply memory-maps the dataset again from your disk.

>>> import numpy as np
>>> from datasets import Dataset, load_from_disk
>>> from torch.utils.data import DataLoader
>>> data = np.random.rand(10_000)
>>> Dataset.from_dict({"data": data}).save_to_disk("my_dataset")
>>> ds = load_from_disk("my_dataset").with_format("torch")
>>> dataloader = DataLoader(ds, batch_size=32, num_workers=4)

Use a BatchSampler

By default, the PyTorch DataLoader load batches of data from a dataset one by one like this:

batch = [dataset[idx] for idx in range(start, end)]

Unfortunately, this does numerous read operations on the dataset. It is more efficient to query batches of examples using a list:

batch = dataset[start:end]
# or
batch = dataset[list_of_indices]

For the PyTorch DataLoader to query batches using a list, you can use a BatchSampler:

>>> from torch.utils.data.sampler import BatchSampler, RandomSampler
>>> sampler = BatchSampler(RandomSampler(ds), batch_size=32, drop_last=False)
>>> dataloader = DataLoader(ds, sampler=sampler)

Moreover, this is particularly useful if you used set_transform to apply a transform on-the-fly when examples are accessed. You must use a BatchSampler if you want the transform to be given full batches instead of receiving batch_size times one single element.

Stream data

Loading a dataset in streaming mode is useful to progressively download the data you need while iterating over the dataset. Set the format of a streaming dataset to torch, and it inherits from torch.utils.data.IterableDataset so you can pass it to a DataLoader:

>>> import numpy as np
>>> from datasets import Dataset, load_dataset
>>> from torch.utils.data import DataLoader
>>> data = np.random.rand(10_000)
>>> Dataset.from_dict({"data": data}).push_to_hub("<username>/my_dataset")  # Upload to the Hugging Face Hub
>>> ds = load_dataset("<username>/my_dataset", streaming=True, split="train").with_format("torch")
>>> dataloader = DataLoader(ds, batch_size=32)

If the dataset is split in several shards (i.e. if the dataset consists of multiple data files), then you can stream in parallel using num_workers:

>>> ds = load_dataset("c4", "en", streaming=True, split="train").with_format("torch")
>>> ds.n_shards
1024
>>> dataloader = DataLoader(ds, batch_size=32, num_workers=4)

In this case each worker will be given a subset of the list of shards to stream from.

Datasets