DuckDB
DuckDB is an in-process SQL OLAP database management system.
Since it supports fsspec to read and write remote data, you can use the Hugging Face paths (hf://
) to read and write data on the Hub:
First you need to Login with your Hugging Face account, for example using:
huggingface-cli login
Then you can Create a dataset repository, for example using:
from huggingface_hub import HfApi
HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
Finally, you can use [Hugging Face paths](Hugging Face paths) in DuckDB:
>>> from huggingface_hub import HfFileSystem
>>> import duckdb
>>> fs = HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> duckdb.sql("COPY tbl TO 'hf://datasets/username/my_dataset/data.parquet' (FORMAT PARQUET);")
This creates a file data.parquet
in the dataset repository username/my_dataset
containing your dataset in Parquet format.
You can reload it later:
>>> from huggingface_hub import HfFileSystem
>>> import duckdb
>>> fs = HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> df = duckdb.query("SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10;").df()
To have more information on the Hugging Face paths and how they are implemented, please refer to the the client library’s documentation on the HfFileSystem.
< > Update on GitHub