Datasets-server documentation

List splits and configurations

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

List splits and configurations

Datasets typically have splits and may also have configurations. A split is a subset of the dataset, like train and test, that are used during different stages of training and evaluating a model. A configuration is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you’re interested in learning more about splits and configurations, check out the conceptual guide on “Splits and configurations”!

split-configs-server

This guide shows you how to use Datasets Server’s /splits endpoint to retrieve a dataset’s splits and configurations programmatically. Feel free to also try it out with Postman, RapidAPI, or ReDoc

The /splits endpoint accepts the dataset name as its query parameter:

Python
JavaScript
cURL
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/splits?dataset=ibm/duorc"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

The endpoint response is a JSON containing a list of the dataset’s splits and configurations. For example, the ibm/duorc dataset has six splits and two configurations:

{
  "splits": [
    { "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "train" },
    { "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "validation" },
    { "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "test" },
    { "dataset": "ibm/duorc", "config": "SelfRC", "split": "train" },
    { "dataset": "ibm/duorc", "config": "SelfRC", "split": "validation" },
    { "dataset": "ibm/duorc", "config": "SelfRC", "split": "test" }
  ],
  "pending": [],
  "failed": []
}