Datasets documentation

Create a dataset card

You are viewing v2.7.0 version. A newer version v3.2.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Create a dataset card

Each dataset should have a dataset card to promote responsible usage and inform users of any potential biases within the dataset. This idea was inspired by the Model Cards proposed by Mitchell, 2018. Dataset cards help users understand a dataset’s contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.

This guide shows you how to create a dataset card.

  1. Create a new dataset card by copying this template to a README.md file in your repository.

  2. Generate structured tags to help users discover your dataset on the Hub. Create the tags with the online Datasets Tagging app.

  3. Select the appropriate tags for your dataset from the dropdown menus.

  4. Copy the YAML tags under Finalized tag set and paste the tags at the top of your README.md file.

  5. Fill out the dataset card sections to the best of your ability. Take a look at the Dataset Card Creation Guide for more detailed information about what to include in each section of the card. For fields you are unable to complete, you can write [More Information Needed].

  6. Once you’re done filling out the dataset card, commit the changes to the README.md file and you should see the completed dataset card on your repository.

Feel free to take a look at these dataset card examples to help you get started:

You can also check out the (similar) documentation about dataset cards on the Hub side.

More YAML tags

You can use the dataset_info YAML fields to define additional metadata for the dataset. Here is an example for SQuAD:

pretty_name: SQuAD
language:
- en
...
dataset_info:
  features:
  - name: id
    dtype: string
  - name: title
    dtype: string
  - name: context
    dtype: string
  - name: question
    dtype: string
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: answer_start
      dtype: int32
  splits:
  - name: train
    num_bytes: 79346360
    num_examples: 87599
  - name: validation
    num_bytes: 10473040
    num_examples: 10570
  download_size: 35142551
  dataset_size: 89819400

These metadata used to be included in the dataset_infos.json file, which is now deprecated.

Feature types

Using the features field you can explicitly define the feature types of your dataset. This is especially useful when type inference is not obvious. For example if there’s only one non-empty example in a 1TB dataset, the type inference is not able to infer the type of each column without going through the full dataset. In this case, specifying the features field makes type inference much easier.

Split sizes

Specifying the split sizes with num_examples enables TQDM bars (otherwise it doesn’t know how many examples are left). It also enables integrity verifications: if the dataset doesn’t have the right number of num_examples, an error is returned.

Additionally you can add num_bytes to specify how big each split is.

Dataset size

When load_dataset() is called, it first downloads the dataset raw data files, and then it prepares the dataset in Arrow format.

You can specify how many bytes are required to download the raw data files with dataset_size, and use dataset_size for the size of the dataset in Arrow format.

Multiple configurations

Certain datasets like glue have several configurations (cola, sst2, etc.) that can be loaded using load_dataset("glue", "cola") for example.

Each configuration can have different features, splits and sizes. You can specify those fields per configuration using a YAML list:

dataset_info:
- config_name: cola
  features:
    ...
  splits:
    ...
  download_size: ...
  dataset_size: ...
- config_name: sst2
  features:
    ...
  splits:
    ...
  download_size: ...
  dataset_size: ...