List Parquet files
Datasets can be published in any format (CSV, JSONL, directories of images, etc.) to the Hub, and they are easily accessed with the 🤗 Datasets library. For a more performant experience (especially when it comes to large datasets), Datasets Server automatically converts every public dataset to the Parquet format. The Parquet files are published to the Hub on a specific ref/convert/parquet
branch (like this amazon_polarity
branch for example).
This guide shows you how to use Datasets Server’s /parquet
endpoint to retrieve a list of a dataset’s files converted to Parquet. Feel free to also try it out with Postman, RapidAPI, or ReDoc.
The /parquet
endpoint accepts the dataset name as its query parameter:
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/parquet?dataset=duorc"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
The endpoint response is a JSON containing a list of the dataset’s files in the Parquet format. For example, the duorc
dataset has six Parquet files, which corresponds to the test
, train
and validation
splits of its two configurations, ParaphraseRC
and SelfRC
(see the List splits and configurations guide for more details about splits and configurations).
The endpoint also gives the filename and size of each file:
{
"parquet_files": [
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "test",
"url": "https://huggingface.co/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/duorc-test.parquet",
"filename": "duorc-test.parquet",
"size": 6136590
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "train",
"url": "https://huggingface.co/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/duorc-train.parquet",
"filename": "duorc-train.parquet",
"size": 26005667
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "validation",
"url": "https://huggingface.co/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/duorc-validation.parquet",
"filename": "duorc-validation.parquet",
"size": 5566867
},
{
"dataset": "duorc",
"config": "SelfRC",
"split": "test",
"url": "https://huggingface.co/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/SelfRC/duorc-test.parquet",
"filename": "duorc-test.parquet",
"size": 3035735
},
{
"dataset": "duorc",
"config": "SelfRC",
"split": "train",
"url": "https://huggingface.co/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/SelfRC/duorc-train.parquet",
"filename": "duorc-train.parquet",
"size": 14851719
},
{
"dataset": "duorc",
"config": "SelfRC",
"split": "validation",
"url": "https://huggingface.co/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/SelfRC/duorc-validation.parquet",
"filename": "duorc-validation.parquet",
"size": 3114389
}
]
}
Sharded Parquet files
Big datasets are partitioned into Parquet files (shards) of about 1GB each. The filename contains the name of the dataset, the split, the shard index, and the total number of shards (dataset-name-train-0000-of-0004.parquet
). For example, the train
split of the amazon_polarity
dataset is partitioned into 4 shards:
{
"parquet_files": [
{
"dataset": "amazon_polarity",
"config": "amazon_polarity",
"split": "test",
"url": "https://huggingface.co/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/amazon_polarity-test.parquet",
"filename": "amazon_polarity-test.parquet",
"size": 117422359
},
{
"dataset": "amazon_polarity",
"config": "amazon_polarity",
"split": "train",
"url": "https://huggingface.co/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/amazon_polarity-train-00000-of-00004.parquet",
"filename": "amazon_polarity-train-00000-of-00004.parquet",
"size": 320281121
},
{
"dataset": "amazon_polarity",
"config": "amazon_polarity",
"split": "train",
"url": "https://huggingface.co/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/amazon_polarity-train-00001-of-00004.parquet",
"filename": "amazon_polarity-train-00001-of-00004.parquet",
"size": 320627716
},
{
"dataset": "amazon_polarity",
"config": "amazon_polarity",
"split": "train",
"url": "https://huggingface.co/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/amazon_polarity-train-00002-of-00004.parquet",
"filename": "amazon_polarity-train-00002-of-00004.parquet",
"size": 320587882
},
{
"dataset": "amazon_polarity",
"config": "amazon_polarity",
"split": "train",
"url": "https://huggingface.co/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/amazon_polarity-train-00003-of-00004.parquet",
"filename": "amazon_polarity-train-00003-of-00004.parquet",
"size": 66515954
}
],
"pending": [],
"failed": []}
To read and query the Parquet files, take a look at the Query datasets from Datasets Server guide.