Each dataset should have a dataset card to promote responsible usage and inform users of any potential biases within the dataset. This idea was inspired by the Model Cards proposed by Mitchell, 2018. Dataset cards help users understand a dataset’s contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.
This guide shows you how to create a dataset card.
Create a new dataset card by copying this template to a
README.mdfile in your repository.
Generate structured tags to help users discover your dataset on the Hub. Create the tags with the online Datasets Tagging app.
Select the appropriate tags for your dataset from the dropdown menus.
Copy the YAML tags under Finalized tag set and paste the tags at the top of your
Fill out the dataset card sections to the best of your ability. Take a look at the Dataset Card Creation Guide for more detailed information about what to include in each section of the card. For fields you are unable to complete, you can write [More Information Needed].
Once you’re done filling out the dataset card, commit the changes to the
README.mdfile and you should see the completed dataset card on your repository.
Feel free to take a look at these dataset card examples to help you get started:
You can also check out the (similar) documentation about dataset cards on the Hub side.
You can use the
dataset_info YAML fields to define additional metadata for the dataset. Here is an example for SQuAD:
pretty_name: SQuAD language: - en ... dataset_info: features: - name: id dtype: string - name: title dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: train num_bytes: 79346360 num_examples: 87599 - name: validation num_bytes: 10473040 num_examples: 10570 download_size: 35142551 dataset_size: 89819400
These metadata used to be included in the
dataset_infos.json file, which is now deprecated.
features field you can explicitly define the feature types of your dataset.
This is especially useful when type inference is not obvious.
For example if there’s only one non-empty example in a 1TB dataset, the type inference is not able to infer the type of each column without going through the full dataset.
In this case, specifying the
features field makes type inference much easier.
Specifying the split sizes with
num_examples enables TQDM bars (otherwise it doesn’t know how many examples are left).
It also enables integrity verifications: if the dataset doesn’t have the right number of
num_examples, an error is returned.
Additionally you can add
num_bytes to specify how big each split is.
When load_dataset() is called, it first downloads the dataset raw data files, and then it prepares the dataset in Arrow format.
You can specify how many bytes are required to download the raw data files with
dataset_size, and use
dataset_size for the size of the dataset in Arrow format.
Certain datasets like
glue have several configurations (
sst2, etc.) that can be loaded using
load_dataset("glue", "cola") for example.
Each configuration can have different features, splits and sizes. You can specify those fields per configuration using a YAML list:
dataset_info: - config_name: cola features: ... splits: ... download_size: ... dataset_size: ... - config_name: sst2 features: ... splits: ... download_size: ... dataset_size: ...