Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

DuckDB

DuckDB is an in-process SQL OLAP database management system. Since it supports fsspec to read and write remote data, you can use the Hugging Face paths (hf://) to read and write data on the Hub:

First you need to Login with your Hugging Face account, for example using:

huggingface-cli login

Then you can Create a dataset repository, for example using:

from huggingface_hub import HfApi

HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")

Finally, you can use [Hugging Face paths](Hugging Face paths) in DuckDB:

>>> from huggingface_hub import HfFileSystem
>>> import duckdb

>>> fs = HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> duckdb.sql("COPY tbl TO 'hf://datasets/username/my_dataset/data.parquet' (FORMAT PARQUET);")

This creates a file data.parquet in the dataset repository username/my_dataset containing your dataset in Parquet format. You can reload it later:

>>> from huggingface_hub import HfFileSystem
>>> import duckdb

>>> fs = HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> df = duckdb.query("SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10;").df()

To have more information on the Hugging Face paths and how they are implemented, please refer to the the client library’s documentation on the HfFileSystem.

< > Update on GitHub