Create a dataset card
Each dataset should have a dataset card to promote responsible usage and inform users of any potential biases within the dataset. This idea was inspired by the Model Cards proposed by Mitchell, 2018. Dataset cards help users understand a dataset’s contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.
This guide shows you how to create a dataset card.
Create a new dataset card by copying this template to a
README.md
file in your repository.Generate structured tags to help users discover your dataset on the Hub. Create the tags with the online Datasets Tagging app.
Select the appropriate tags for your dataset from the dropdown menus.
Copy the YAML tags under Finalized tag set and paste the tags at the top of your
README.md
file.Fill out the dataset card sections to the best of your ability. Take a look at the Dataset Card Creation Guide for more detailed information about what to include in each section of the card. For fields you are unable to complete, you can write [More Information Needed].
Once you’re done filling out the dataset card, commit the changes to the
README.md
file and you should see the completed dataset card on your repository.
Feel free to take a look at these dataset card examples to help you get started:
You can also check out the (similar) documentation about dataset cards on the Hub side.
More YAML tags
You can use the dataset_info
YAML fields to define additional metadata for the dataset. Here is an example for SQuAD:
pretty_name: SQuAD
language:
- en
...
dataset_info:
features:
- name: id
dtype: string
- name: title
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: answers
sequence:
- name: text
dtype: string
- name: answer_start
dtype: int32
splits:
- name: train
num_bytes: 79346360
num_examples: 87599
- name: validation
num_bytes: 10473040
num_examples: 10570
download_size: 35142551
dataset_size: 89819400
These metadata used to be included in the dataset_infos.json
file, which is now deprecated.
Feature types
Using the features
field you can explicitly define the feature types of your dataset.
This is especially useful when type inference is not obvious.
For example if there’s only one non-empty example in a 1TB dataset, the type inference is not able to infer the type of each column without going through the full dataset.
In this case, specifying the features
field makes type inference much easier.
Split sizes
Specifying the split sizes with num_examples
enables TQDM bars (otherwise it doesn’t know how many examples are left).
It also enables integrity verifications: if the dataset doesn’t have the right number of num_examples
, an error is returned.
Additionally you can add num_bytes
to specify how big each split is.
Dataset size
When load_dataset() is called, it first downloads the dataset raw data files, and then it prepares the dataset in Arrow format.
You can specify how many bytes are required to download the raw data files with dataset_size
, and use dataset_size
for the size of the dataset in Arrow format.
Multiple configurations
Certain datasets like glue
have several configurations (cola
, sst2
, etc.) that can be loaded using load_dataset("glue", "cola")
for example.
Each configuration can have different features, splits and sizes. You can specify those fields per configuration using a YAML list:
dataset_info:
- config_name: cola
features:
...
splits:
...
download_size: ...
dataset_size: ...
- config_name: sst2
features:
...
splits:
...
download_size: ...
dataset_size: ...