Overview
The dataset viewer automatically converts and publishes public datasets less than 5GB on the Hub as Parquet files. If the dataset is already in Parquet format, it will be published as is. Parquet files are column-based and they shine when you’re working with big data.
For private datasets, the feature is provided if the repository is owned by a PRO user or an Enterprise Hub organization.
There are several different libraries you can use to work with the published Parquet files:
- ClickHouse, a column-oriented database management system for online analytical processing
- cuDF, a Python GPU DataFrame library
- DuckDB, a high-performance SQL database for analytical queries
- Pandas, a data analysis tool for working with data structures
- Polars, a Rust based DataFrame library
- mlcroissant, a library for loading datasets from Croissant metadata
- pyspark, the Python API for Apache Spark