Check dataset validity
Before you download a dataset from the Hub, it is helpful to know which datasets are available or if a specific dataset you’re interested in is available. Datasets Server provides two endpoints for verifying whether a dataset is valid or not:
/valid
returns a list of all the datasets that work without any errors./is-valid
checks if a specific dataset works without any errors.
The API endpoints will return an error for datasets that cannot be loaded with the 🤗 Datasets library, for example, because the data hasn’t been uploaded or the format is not supported.
This guide shows you how to check dataset validity programmatically, but free to try it out with Postman, RapidAPI, or ReDoc.
Get all valid datasets
The /valid
endpoint returns a list of Hub datasets that are expected to load without any errors. This endpoint takes no query parameters:
import requests
API_URL = "https://datasets-server.huggingface.co/valid"
def query():
response = requests.get(API_URL)
return response.json()
data = query()
The endpoint response is a JSON containing a list valid datasets nested under the valid
key:
{
"valid": [
"0n1xus/codexglue",
"0n1xus/pytorrent-standalone",
"0x7194633/rupile",
"51la5/keyword-extraction",
...
]
}
Check if a dataset is valid
On the other hand, /is-valid
checks whether a specific dataset loads without any error. This endpoint’s query parameter requires you to specify the name of the dataset:
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/is-valid?dataset=rotten_tomatoes"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
The response looks like this if a dataset is valid:
{"valid": true}
If a dataset is not valid, then the response looks like:
{"valid": false}
Some cases where a dataset is not valid are:
- the dataset viewer is disabled
- the dataset is gated but the access is not granted: no token is passed or the passed token is not authorized
- the dataset is private
- the dataset contains no data or the data format is not supported