Datasets documentation

The `Dataset` object

You are viewing v2.0.0 version. A newer version v2.20.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

The Dataset object

In the previous tutorial, you learned how to successfully load a dataset. This section will familiarize you with the Dataset object. You will learn about the metadata stored inside a Dataset object, and the basics of querying a Dataset object to return rows and columns.

A Dataset object is returned when you load an instance of a dataset. This object behaves like a normal Python container.

>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')

Metadata

The Dataset object contains a lot of useful information about your dataset. For example, call [‘dataset.info’] to return a short description of the dataset, the authors, and even the dataset size. This will give you a quick snapshot of the datasets most important attributes.

>>> dataset.info
DatasetInfo(
    description='GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n', 
    citation='@inproceedings{dolan2005automatically,\n  title={Automatically constructing a corpus of sentential paraphrases},\n  author={Dolan, William B and Brockett, Chris},\n  booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n  year={2005}\n}\n@inproceedings{wang2019glue,\n  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n  note={In the Proceedings of ICLR.},\n  year={2019}\n}\n', homepage='https://www.microsoft.com/en-us/download/details.aspx?id=52398', 
    license='', 
    features={'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}, post_processed=None, supervised_keys=None, builder_name='glue', config_name='mrpc', version=1.0.0, splits={'train': SplitInfo(name='train', num_bytes=943851, num_examples=3668, dataset_name='glue'), 'validation': SplitInfo(name='validation', num_bytes=105887, num_examples=408, dataset_name='glue'), 'test': SplitInfo(name='test', num_bytes=442418, num_examples=1725, dataset_name='glue')}, 
    download_checksums={'https://dl.fbaipublicfiles.com/glue/data/mrpc_dev_ids.tsv': {'num_bytes': 6222, 'checksum': '971d7767d81b997fd9060ade0ec23c4fc31cbb226a55d1bd4a1bac474eb81dc7'}, 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt': {'num_bytes': 1047044, 'checksum': '60a9b09084528f0673eedee2b69cb941920f0b8cd0eeccefc464a98768457f89'}, 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt': {'num_bytes': 441275, 'checksum': 'a04e271090879aaba6423d65b94950c089298587d9c084bf9cd7439bd785f784'}}, 
    download_size=1494541, 
    post_processing_size=None, 
    dataset_size=1492156, 
    size_in_bytes=2986697
)

You can request specific attributes of the dataset, like description, citation, and homepage, by calling them directly. Take a look at datasets.DatasetInfo for a complete list of attributes you can return.

>>> dataset.split
NamedSplit('train')
>>> dataset.description
'GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n'
>>> dataset.citation
'@inproceedings{dolan2005automatically,\n  title={Automatically constructing a corpus of sentential paraphrases},\n  author={Dolan, William B and Brockett, Chris},\n  booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n  year={2005}\n}\n@inproceedings{wang2019glue,\n  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n  note={In the Proceedings of ICLR.},\n  year={2019}\n}\n\nNote that each GLUE dataset has its own citation. Please see the source to see\nthe correct citation for each contained dataset.'
>>> dataset.homepage
'https://www.microsoft.com/en-us/download/details.aspx?id=52398'

Features and columns

A dataset is a table of rows and typed columns. Querying a dataset returns a Python dictionary where the keys correspond to column names, and the values correspond to column values:

>>> dataset[0]
{'idx': 0,
'label': 1,
'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

Return the number of rows and columns with the following standard attributes:

>>> dataset.shape
(3668, 4)
>>> dataset.num_columns
4
>>> dataset.num_rows
3668
>>> len(dataset)
3668

List the columns names with datasets.Dataset.column_names():

>>> dataset.column_names
['idx', 'label', 'sentence1', 'sentence2']

Get detailed information about the columns with datasets.Dataset.features()

>>> dataset.features
{'idx': Value(dtype='int32', id=None),
    'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
    'sentence1': Value(dtype='string', id=None),
    'sentence2': Value(dtype='string', id=None),
}

Return even more specific information about a feature like datasets.ClassLabel, by calling its parameters num_classes and names:

>>> dataset.features['label'].num_classes
2
>>> dataset.features['label'].names
['not_equivalent', 'equivalent']

Rows, slices, batches, and columns

Get several rows of your dataset at a time with slice notation or a list of indices:

>>> dataset[:3]
{'idx': [0, 1, 2],
    'label': [1, 0, 1],
    'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'],
    'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale ."]
}
>>> dataset[[1, 3, 5]]
{'idx': [1, 3, 5],
    'label': [0, 0, 1], 
    'sentence1': ["Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', 'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .'],
    'sentence2': ["Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", 'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .', "With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier ."]
}

Querying by the column name will return its values. For example, if you want to only return the first three examples:

>>> dataset['sentence1'][:3]
['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .']

Depending on how a Dataset object is queried, the format returned will be different:

  • A single row like dataset[0] returns a Python dictionary of values.
  • A batch like dataset[5:10] returns a Python dictionary of lists of values.
  • A column like dataset['sentence1'] returns a Python list of values.