Dataset streaming lets you get started with a dataset without waiting for the entire dataset to download. The data is downloaded progressively as you iterate over the dataset. This is especially helpful when:

  • You don’t want to wait for an extremely large dataset to download.

  • The dataset size exceeds the amount of disk space on your computer.


For example, the English split of the OSCAR dataset is 1.2 terabytes, but you can use it instantly with streaming. Stream a dataset by setting streaming=True in datasets.load_dataset() as shown below:

>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> print(next(iter(dataset)))
{'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help...

Loading a dataset in streaming mode creates a new dataset type instance (instead of the classic datasets.Dataset object), known as an datasets.IterableDataset. This special type of dataset has its own set of processing methods shown below.


An datasets.IterableDataset is useful for iterative jobs like training a model. You shouldn’t use a datasets.IterableDataset for jobs that require random access to examples because you have to iterate all over it using a for loop. Getting the last example in an iterable dataset would require you to iterate over all the previous examples.


Like a regular datasets.Dataset object, you can also shuffle a datasets.IterableDataset with datasets.IterableDataset.shuffle().

The buffer_size argument controls the size of the buffer to randomly sample examples from. Let’s say your dataset has one million examples, and you set the buffer_size to ten thousand. datasets.IterableDataset.shuffle() will randomly select examples from the first ten thousand examples in the buffer. Selected examples in the buffer are replaced with new examples.

>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> shuffled_dataset = dataset.shuffle(buffer_size=10_000, seed=42)


datasets.IterableDataset.shuffle() will also shuffle the order of the shards if the dataset is sharded into multiple sets.


Sometimes you may want to reshuffle the dataset after each epoch. This will require you to set a different seed for each epoch. Use datasets.IterableDataset.set_epoch() in between epochs to tell the dataset what epoch you’re on.

Your seed effectively becomes: initial seed + current epoch.

>>> for epoch in range(epochs):
...     shuffled_dataset.set_epoch(epoch)
...     for example in shuffled_dataset:
...         ...

Split dataset

You can split your dataset one of two ways:

>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> dataset_head = dataset.take(2)
>>> list(dataset_head)
[{'id': 0, 'text': 'Mtendere Village was...'}, '{id': 1, 'text': 'Lily James cannot fight the music...'}]
>>> train_dataset = shuffled_dataset.skip(1000)


take and skip prevent future calls to shuffle because they lock in the order of the shards. You should shuffle your dataset before splitting it.


datasets.interleave_datasets() can combine an datasets.IterableDataset with other datasets. The combined dataset returns alternating examples from each of the original datasets.

>>> from datasets import interleave_datasets
>>> from itertools import islice
>>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True)

>>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset])
>>> print(list(islice(multilingual_dataset, 2)))
[{'text': 'Mtendere Village was inspired by the vision...}, {'text': "Média de débat d'idées, de culture et de littérature....}]

Define sampling probabilities from each of the original datasets for more control over how each of them are sampled and combined. Set the probabilities argument with your desired sampling probabilities:

>>> multilingual_dataset_with_oversampling = interleave_datasets([en_dataset, fr_dataset], probabilities=[0.8, 0.2], seed=42)
>>> print(list(islice(multilingual_dataset_with_oversampling, 2)))
[{'text': 'Mtendere Village was inspired by the vision...}, {'text': 'Lily James cannot fight the music...}]

Around 80% of the final dataset is made of the en_dataset, and 20% of the fr_dataset.


Remove columns on-the-fly with datasets.IterableDataset.remove_columns(). Specify the name of the column to remove:

>>> from datasets import load_dataset
>>> dataset = load_dataset('m4', 'en', streaming=True, split='train')
>>> dataset = dataset.remove_columns('timestamp')


Similar to the function for a regular datasets.Dataset, 🤗 Datasets features for processing datasets.IterableDatasets. applies processing on-the-fly when examples are streamed.

It allows you to apply a processing function to each example in a dataset, independently or in batches. This function can even create new rows and columns.

The following example demonstrates how to tokenize a datasets.IterableDataset. The function needs to accept and output a dict:

>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> dataset = load_dataset("mc4", "en", streaming=True, split="train")
>>> tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
>>> def encode(examples):
...     return tokenizer(examples['text'], truncation=True, padding='max_length')
>>> dataset =, batched=True)
>>> next(iter(dataset))
{'input_ids': 101, 8466, 1018, 1010, 4029, 2475, 2062, 18558, 3100, 2061, ...,1106, 3739, 102],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..., 1, 1]}

Stream in a training loop

datasets.IterableDataset can be integrated into a training loop. First, shuffle the dataset:

>>> buffer_size, seed = 10_000, 42
>>> dataset = dataset.shuffle(buffer_size, seed)

Lastly, create a simple training loop and start training:

>>> import torch
>>> from import DataLoader
>>> from transformers import AutoModelForMaskedLM, DataCollatorForLanguageModeling
>>> from tqdm import tqdm
>>> dataset = dataset.with_format("torch")
>>> dataloader = DataLoader(dataset, collate_fn=DataCollatorForLanguageModeling(tokenizer))
>>> device = 'cuda' if torch.cuda.is_available() else 'cpu'
>>> model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")
>>> model.train().to(device)
>>> optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)
>>> for epoch in range(3):
...     dataset.set_epoch(epoch)
...     for i, batch in enumerate(tqdm(dataloader, total=5)):
...         if i == 5:
...             break
...         batch = {k: for k, v in batch.items()}
...         outputs = model(**batch)
...         loss = outputs[0]
...         loss.backward()
...         optimizer.step()
...         optimizer.zero_grad()
...         if i % 10 == 0:
...             print(f"loss: {loss}")