Pandas

Pandas is a widely used Python data analysis toolkit. Since it uses fsspec to read and write remote data, you can use the Hugging Face paths (hf://) to read and write data on the Hub:

First you need to Login with your Hugging Face account, for example using:

huggingface-cli login

Then you can Create a dataset repository, for example using:

from huggingface_hub import HfApi

HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")

Finally, you can use Hugging Face paths in Pandas:

import pandas as pd

df.to_parquet("hf://datasets/username/my_dataset/data.parquet")

# or write in separate files if the dataset has train/validation/test splits
df_train.to_parquet("hf://datasets/username/my_dataset/train.parquet")
df_valid.to_parquet("hf://datasets/username/my_dataset/validation.parquet")
df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet")

This creates a dataset repository username/my_dataset containing your Pandas dataset in Parquet format. You can reload it later:

import pandas as pd

df = pd.read_parquet("hf://datasets/username/my_dataset/data.parquet")

# or read from separate files if the dataset has train/validation/test splits
df_train = pd.read_parquet("hf://datasets/username/my_dataset/train.parquet")
df_valid = pd.read_parquet("hf://datasets/username/my_dataset/validation.parquet")
df_test  = pd.read_parquet("hf://datasets/username/my_dataset/test.parquet")

To have more information on the Hugging Face paths and how they are implemented, please refer to the the client library’s documentation on the HfFileSystem.

< > Update on GitHub