Datasets 🤝 Arrow

What is Arrow?

Arrow enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages:

Arrow’s standard format allows zero-copy reads which removes virtually all serialization overhead.
Arrow is language-agnostic so it supports different programming languages.
Arrow is column-oriented so it is faster at querying and processing slices or columns of data.
Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow.
Arrow supports many, possibly nested, column types.

Memory-mapping

🤗 Datasets uses Arrow for its local caching system. It allows datasets to be backed by an on-disk cache, which is memory-mapped for fast lookup. This architecture allows for large datasets to be used on machines with relatively small device memory.

For example, loading the full English Wikipedia dataset only takes a few MB of RAM:

>>> import os; import psutil; import timeit
>>> from datasets import load_dataset

# Process.memory_info is expressed in bytes, so convert to megabytes 
>>> mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)
>>> wiki = load_dataset("wikipedia", "20220301.en", split="train")
>>> mem_after = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)

>>> print(f"RAM memory used: {(mem_after - mem_before)} MB")
RAM memory used: 50 MB

This is possible because the Arrow data is actually memory-mapped from disk, and not loaded in memory. Memory-mapping allows access to data on disk, and leverages virtual memory capabilities for fast lookups.

Performance

Iterating over a memory-mapped dataset using Arrow is fast. Iterating over Wikipedia on a laptop gives you speeds of 1-3 Gbit/s:

>>> s = """batch_size = 1000
... for i in range(0, len(wiki), batch_size):
...     batch = wiki[i:i + batch_size]
... """

>>> time = timeit.timeit(stmt=s, number=1, globals=globals())
>>> print(f"Time to iterate over the {wiki.dataset_size >> 30} GB dataset: {time:.1f} sec, "
...       f"ie. {float(wiki.dataset_size >> 27)/time:.1f} Gb/s")
Time to iterate over the 18 GB dataset: 70.5 sec, ie. 2.1 Gb/s

You can obtain the best performance by accessing slices of data (or “batches”), in order to reduce the amount of lookups on disk.