Dataset viewer documentation

List splits and subsets

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

List splits and subsets

Datasets typically have splits and may also have subsets. A split is a subset of the dataset, like train and test, that are used during different stages of training and evaluating a model. A subset (also called configuration) is a sub-dataset contained within a larger dataset. Subsets are especially common in multilingual speech datasets where there may be a different subset for each language. If you’re interested in learning more about splits and subsets, check out the conceptual guide on “Splits and subsets”!

split-configs-server

This guide shows you how to use the dataset viewer’s /splits endpoint to retrieve a dataset’s splits and subsets programmatically. Feel free to also try it out with Postman, RapidAPI, or ReDoc

The /splits endpoint accepts the dataset name as its query parameter:

Python
JavaScript
cURL
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/splits?dataset=ibm/duorc"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

The endpoint response is a JSON containing a list of the dataset’s splits and subsets. For example, the ibm/duorc dataset has six splits and two subsets:

{
  "splits": [
    { "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "train" },
    { "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "validation" },
    { "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "test" },
    { "dataset": "ibm/duorc", "config": "SelfRC", "split": "train" },
    { "dataset": "ibm/duorc", "config": "SelfRC", "split": "validation" },
    { "dataset": "ibm/duorc", "config": "SelfRC", "split": "test" }
  ],
  "pending": [],
  "failed": []
}
< > Update on GitHub