Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Check dataset validity

Before you download a dataset from the Hub, it is helpful to know which datasets are available or if a specific dataset you’re interested in is available. Datasets Server provides two endpoints for verifying whether a dataset is valid or not:

  • /valid returns a list of all the datasets that work without any errors.
  • /is-valid checks if a specific dataset works without any errors.

The API endpoints will return an error for datasets that cannot be loaded with the 🤗 Datasets library, for example, because the data hasn’t been uploaded or the format is not supported.

Currently, only streamable datasets are supported so Datasets Server can extract the 100 first rows without downloading the whole dataset. This is especially useful for previewing large datasets where downloading the whole dataset may take hours!

This guide shows you how to check dataset validity programmatically, but free to try it out with Postman, RapidAPI, or ReDoc.

Get all valid datasets

The /valid endpoint returns a list of Hub datasets that are expected to load without any errors. This endpoint takes no query parameters:

Python
JavaScript
cURL
import requests
API_URL = "https://datasets-server.huggingface.co/valid"
def query():
    response = requests.get(API_URL)
    return response.json()
data = query()

The endpoint response is a JSON containing a list valid datasets nested under the valid key:

{
  "valid": [
    "0n1xus/codexglue",
    "0n1xus/pytorrent-standalone",
    "0x7194633/rupile",
    "51la5/keyword-extraction",
    ...
  ]
}

Check if a dataset is valid

On the other hand, /is-valid checks whether a specific dataset loads without any error. This endpoint’s query parameter requires you to specify the name of the dataset:

Python
JavaScript
cURL
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/is-valid?dataset=rotten_tomatoes"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

The response looks like this if a dataset is valid:

{"valid": true}

If a dataset is not valid, then the response looks like:

{"valid": false}

Some cases where a dataset is not valid are:

  • the dataset viewer is disabled
  • the dataset is gated but the access is not granted: no token is passed or the passed token is not authorized
  • the dataset is private
  • the dataset contains no data or the data format is not supported
Remember if a dataset is gated, you'll need to provide your user token to submit a successful query!