# Datasets 🤝 Arrow ## What is Arrow? [Arrow](https://arrow.apache.org/) enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages: * Arrow's standard format allows [zero-copy reads](https://en.wikipedia.org/wiki/Zero-copy) which removes virtually all serialization overhead. * Arrow is language-agnostic so it supports different programming languages. * Arrow is column-oriented so it is faster at querying and processing slices or columns of data. * Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow. * Arrow supports many, possibly nested, column types. ## Memory-mapping 🤗 Datasets uses Arrow for its local caching system. It allows datasets to be backed by an on-disk cache, which is memory-mapped for fast lookup. This architecture allows for large datasets to be used on machines with relatively small device memory. For example, loading the full English Wikipedia dataset only takes a few MB of RAM: ```python >>> import os; import psutil; import timeit >>> from datasets import load_dataset # Process.memory_info is expressed in bytes, so convert to megabytes >>> mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024) >>> wiki = load_dataset("wikipedia", "20200501.en", split='train') >>> mem_after = psutil.Process(os.getpid()).memory_info().rss >> 20 >>> print(f"RAM memory used: {(mem_after - mem_before)} MB") 'RAM memory used: 9 MB' ``` This is possible because the Arrow data is actually memory-mapped from disk, and not loaded in memory. Memory-mapping allows access to data on disk, and leverages virtual memory capabilities for fast lookups. ## Performance Iterating over a memory-mapped dataset using Arrow is fast. Iterating over Wikipedia on a laptop gives you speeds of 1-3 Gbit/s: ```python >>> s = """batch_size = 1000 ... for i in range(0, len(wiki), batch_size): ... batch = wiki[i:i + batch_size] ... """ >>> time = timeit.timeit(stmt=s, number=1, globals=globals()) >>> print(f"Time to iterate over the {wiki.dataset_size >> 30} GB dataset: {time:.1f} sec, " ... f"ie. {float(wiki.dataset_size >> 27)/time:.1f} Gb/s") 'Time to iterate over the 17 GB dataset: 85 sec, ie. 1.7 Gb/s' ``` You can obtain the best performance by accessing slices of data (or "batches"), in order to reduce the amount of lookups on disk.