Dask

Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. Since it uses fsspec to read and write remote data, you can use the Hugging Face paths (hf://) to read and write data on the Hub:

First you need to Login with your Hugging Face account, for example using:

huggingface-cli login

Then you can Create a dataset repository, for example using:

from huggingface_hub import HfApi

HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")

Finally, you can use Hugging Face paths in Dask:

import dask.dataframe as dd

df.to_parquet("hf://datasets/username/my_dataset")

# or write in separate directories if the dataset has train/validation/test splits
df_train.to_parquet("hf://datasets/username/my_dataset/train")
df_valid.to_parquet("hf://datasets/username/my_dataset/validation")
df_test .to_parquet("hf://datasets/username/my_dataset/test")

This creates a dataset repository username/my_dataset containing your Dask dataset in Parquet format. You can reload it later:

import dask.dataframe as dd

df = dd.read_parquet("hf://datasets/username/my_dataset")

# or read from separate directories if the dataset has train/validation/test splits
df_train = dd.read_parquet("hf://datasets/username/my_dataset/train")
df_valid = dd.read_parquet("hf://datasets/username/my_dataset/validation")
df_test  = dd.read_parquet("hf://datasets/username/my_dataset/test")

For more information on the Hugging Face paths and how they are implemented, please refer to the the client library’s documentation on the HfFileSystem.

< > Update on GitHub