List splits and configurations
Datasets typically have splits and may also have configurations. A split is a subset of the dataset, like train
and test
, that are used during different stages of training and evaluating a model. A configuration is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you’re interested in learning more about splits and configurations, check out the Load a dataset from the Hub tutorial!
This guide shows you how to use Datasets Server’s /splits
endpoint to retrieve a dataset’s splits and configurations programmatically. Feel free to also try it out with Postman, RapidAPI, or ReDoc
The /splits
endpoint accepts the dataset name as its query parameter:
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/splits?dataset=duorc"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
The endpoint response is a JSON containing a list of the dataset’s splits and configurations. For example, the duorc dataset has six splits and two configurations:
{
"splits": [
{
"dataset": "duorc",
"config": "SelfRC",
"split": "train",
"num_bytes": 239852925,
"num_examples": 60721
},
{
"dataset": "duorc",
"config": "SelfRC",
"split": "validation",
"num_bytes": 51662575,
"num_examples": 12961
},
{
"dataset": "duorc",
"config": "SelfRC",
"split": "test",
"num_bytes": 49142766,
"num_examples": 12559
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "train",
"num_bytes": 496683105,
"num_examples": 69524
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "validation",
"num_bytes": 106510545,
"num_examples": 15591
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "test",
"num_bytes": 115215816,
"num_examples": 15857
}
]
}