Datasets 🤝 Arrow

What is Arrow?

Arrow enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages:

  • Arrow’s standard format allows zero-copy reads which removes virtually all serialization overhead.

  • Arrow is language-agnostic so it supports different programming languages.

  • Arrow is column-oriented so it is faster at querying and processing slices or columns of data.

  • Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow.

  • Arrow supports many, possibly nested, column types.

Memory-mapping

🤗 Datasets uses Arrow for its local caching system. It allows datasets to be backed by an on-disk cache, which is memory-mapped for fast lookup. This architecture allows for large datasets to be used on machines with relatively small device memory.

For example, loading the full English Wikipedia dataset only takes a few MB of RAM:

>>> import os; import psutil; import timeit
>>> from datasets import load_dataset

# Process.memory_info is expressed in bytes, so convert to megabytes 
>>> mem_before = psutil.Process(os.getpid()).memory_info().rss  / (1024 * 1024)
>>> wiki = load_dataset("wikipedia", "20200501.en", split='train')
>>> mem_after = psutil.Process(os.getpid()).memory_info().rss >> 20

>>> print(f"RAM memory used: {(mem_after - mem_before)} MB")
'RAM memory used: 9 MB'

This is possible because the Arrow data is actually memory-mapped from disk, and not loaded in memory. Memory-mapping allows access to data on disk, and leverages virtual memory capabilities for fast lookups.

Performance

Iterating over a memory-mapped dataset using Arrow is fast. Iterating over Wikipedia on a laptop gives you speeds of 1-3 Gbit/s:

>>> s = """batch_size = 1000
... for i in range(0, len(wiki), batch_size):
...     batch = wiki[i:i + batch_size]
... """

>>> time = timeit.timeit(stmt=s, number=1, globals=globals())
>>> print(f"Time to iterate over the {wiki.dataset_size >> 30} GB dataset: {time:.1f} sec, "
...       f"ie. {float(wiki.dataset_size >> 27)/time:.1f} Gb/s")
'Time to iterate over the 17 GB dataset: 85 sec, ie. 1.7 Gb/s'

You can obtain the best performance by accessing slices of data (or “batches”), in order to reduce the amount of lookups on disk.