Explore statistics over split data

The dataset viewer provides a /statistics endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.

Currently, statistics are computed only for datasets with Parquet exports.

The /statistics endpoint requires three query parameters:

dataset: the dataset name, for example nyu-mll/glue
config: the subset name, for example cola
split: the split name, for example train

Let’s get some stats for nyu-mll/glue dataset, cola subset, train split:

Python

JavaScript

cURL

The response JSON contains three keys:

num_examples - number of samples in a split or number of samples in the first chunk of data if dataset is larger than 5GB (see partial field below).
statistics - list of dictionaries of statistics per each column, each dictionary has three keys: column_name, column_type, and column_statistics. Content of column_statistics depends on a column type, see Response structure by data types for more details
partial - true if statistics are computed on the first 5 GB of data, not on the full split, false otherwise.

{
  "num_examples": 8551,
  "statistics": [
    {
      "column_name": "idx",
      "column_type": "int",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 0,
        "max": 8550,
        "mean": 4275,
        "median": 4275,
        "std": 2468.60541,
        "histogram": {
          "hist": [
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            847
          ],
          "bin_edges": [
            0,
            856,
            1712,
            2568,
            3424,
            4280,
            5136,
            5992,
            6848,
            7704,
            8550
          ]
        }
      }
    },
    {
      "column_name": "label",
      "column_type": "class_label",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "no_label_count": 0,
        "no_label_proportion": 0,
        "n_unique": 2,
        "frequencies": {
          "unacceptable": 2528,
          "acceptable": 6023
        }
      }
    },
    {
      "column_name": "sentence",
      "column_type": "string_text",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 6,
        "max": 231,
        "mean": 40.70074,
        "median": 37,
        "std": 19.14431,
        "histogram": {
          "hist": [
            2260,
            4512,
            1262,
            380,
            102,
            26,
            6,
            1,
            1,
            1
          ],
          "bin_edges": [
            6,
            29,
            52,
            75,
            98,
            121,
            144,
            167,
            190,
            213,
            231
          ]
        }
      }
    }
  ],
  "partial": false
}

Response structure by data type

Currently, statistics are supported for strings, float and integer numbers, lists, datetimes, audio and image data and the special datasets.ClassLabel feature type of the datasets library.

column_type in response can be one of the following values:

class_label - for datasets.ClassLabel feature which represents categorical data
float - for float data types
int - for integer data types
bool - for boolean data type
string_label - for string data types being treated as categories (see below)
string_text - for string data types if they do not represent categories (see below)
list - for lists of any other data types (including lists)
audio - for audio data
image - for image data
datetime - for datetime data

class_label

This type represents categorical data encoded as ClassLabel feature. The following measures are computed:

number and proportion of null values
number and proportion of values with no label
number of unique values (excluding null and no label)
value counts for each label (excluding null and no label)

Example

{
  "column_name": "label",
  "column_type": "class_label",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "no_label_count": 0,
    "no_label_proportion": 0,
    "n_unique": 2,
    "frequencies": {
      "unacceptable": 2528,
      "acceptable": 6023
    }
  }
}

float

The following measures are returned for float data types:

minimum, maximum, mean, median, and standard deviation values
number and proportion of null and NaN values (NaN values are treated as null)
histogram with 10 bins

Example

{
  "column_name": "clarity",
  "column_type": "float",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "min": 0,
    "max": 2,
    "mean": 1.67206,
    "median": 1.8,
    "std": 0.38714,
    "histogram": {
      "hist": [
        17,
        12,
        48,
        52,
        135,
        188,
        814,
        15,
        1628,
        2048
      ],
      "bin_edges": [
        0,
        0.2,
        0.4,
        0.6,
        0.8,
        1,
        1.2,
        1.4,
        1.6,
        1.8,
        2
      ]
    }
  }
}

int

The following measures are returned for integer data types:

minimum, maximum, mean, median, and standard deviation values
number and proportion of null values
histogram with less than or equal to 10 bins

Example

{
    "column_name": "direction",
    "column_type": "int",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": 0,
        "max": 1,
        "mean": 0.49925,
        "median": 0.0,
        "std": 0.5,
        "histogram": {
            "hist": [
                50075,
                49925
            ],
            "bin_edges": [
                0,
                1,
                1
            ]
        }
    }
}

bool

The following measures are returned for bool data type:

number and proportion of null values
value counts for 'True' and 'False' values

Example

{
  "column_name": "penalty",
  "column_type": "bool",
  "column_statistics":
    {
        "nan_count": 3,
        "nan_proportion": 0.15,
        "frequencies": {
            "False": 7,
            "True": 10
        }
    }
}

string_label

If the proportion of unique values in a string column within requested split is lower than or equal to 0.2 and the number of unique values is lower than 1000, or if the number of unique values is lower or equal to 10 (independently of the proportion), it is considered to be a category. The following measures are returned:

number and proportion of null values
number of unique values (excluding null)
value counts for each label (excluding null)

Example

{
  "column_name": "answerKey",
  "column_type": "string_label",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "n_unique": 4,
    "frequencies": {
      "D": 1221,
      "C": 1146,
      "A": 1378,
      "B": 1212
    }
  }
}

string_text

If string column does not satisfy the conditions to be treated as a string_label, it is considered to be a column containing texts and response contains statistics over text lengths which are calculated by character number. The following measures are computed:

minimum, maximum, mean, median, and standard deviation of text lengths
number and proportion of null values
histogram of text lengths with 10 bins

Example

{
  "column_name": "sentence",
  "column_type": "string_text",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "min": 6,
    "max": 231,
    "mean": 40.70074,
    "median": 37,
    "std": 19.14431,
    "histogram": {
      "hist": [
        2260,
        4512,
        1262,
        380,
        102,
        26,
        6,
        1,
        1,
        1
      ],
      "bin_edges": [
        6,
        29,
        52,
        75,
        98,
        121,
        144,
        167,
        190,
        213,
        231
      ]
    }
  }
}

list

For lists, the distribution of their lengths is computed. The following measures are returned:

minimum, maximum, mean, median, and standard deviation of lists lengths
number and proportion of null values
histogram of lists lengths with up to 10 bins

Example

{
    "column_name": "chat_history",
    "column_type": "list",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": 1,
        "max": 3,
        "mean": 1.01741,
        "median": 1.0,
        "std": 0.13146,
        "histogram": {
            "hist": [
                11177,
                196,
                1
            ],
            "bin_edges": [
                1,
                2,
                3,
                3
            ]
        }
    }
}

Note that dictionaries of lists are not supported.

audio

For audio data, the distribution of audio files durations is computed. The following measures are returned:

minimum, maximum, mean, median, and standard deviation of audio files durations
number and proportion of null values
histogram of audio files durations with 10 bins

Example

{
    "column_name": "audio",
    "column_type": "audio",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 1.02,
        "max": 15,
        "mean": 13.93042,
        "median": 14.77,
        "std": 2.63734,
        "histogram": {
            "hist": [
                32,
                25,
                18,
                24,
                22,
                17,
                18,
                19,
                55,
                1770
            ],
            "bin_edges": [
                1.02,
                2.418,
                3.816,
                5.214,
                6.612,
                8.01,
                9.408,
                10.806,
                12.204,
                13.602,
                15
            ]
        }
    }
}

image

For image data, the distribution of images widths is computed. The following measures are returned:

minimum, maximum, mean, median, and standard deviation of widths of image files
number and proportion of null values
histogram of images widths with 10 bins

Example

{
    "column_name": "image",
    "column_type": "image",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": 256,
        "max": 873,
        "mean": 327.99339,
        "median": 341.0,
        "std": 60.07286,
        "histogram": {
            "hist": [
                1734,
                1637,
                1326,
                121,
                10,
                3,
                1,
                3,
                1,
                2
            ],
            "bin_edges": [
                256,
                318,
                380,
                442,
                504,
                566,
                628,
                690,
                752,
                814,
                873
            ]
        }
    }
}

datetime

The distribution of datetime is computed. The following measures are returned:

minimum, maximum, mean, median, and standard deviation of datetimes represented as strings with precision up to seconds
number and proportion of null values
histogram of datetimes with 10 bins

Example

{
    "column_name": "date",
    "column_type": "datetime",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": "2013-05-18 04:54:11",
        "max": "2013-06-20 10:01:41",
        "mean": "2013-05-27 18:03:39",
        "median": "2013-05-23 11:55:50",
        "std": "11 days, 4:57:32.322450",
        "histogram": {
            "hist": [
                318776,
                393036,
                173904,
                0,
                0,
                0,
                0,
                0,
                0,
                206284
            ],
            "bin_edges": [
                "2013-05-18 04:54:11",
                "2013-05-21 12:36:57",
                "2013-05-24 20:19:43",
                "2013-05-28 04:02:29",
                "2013-05-31 11:45:15",
                "2013-06-03 19:28:01",
                "2013-06-07 03:10:47",
                "2013-06-10 10:53:33",
                "2013-06-13 18:36:19",
                "2013-06-17 02:19:05",
                "2013-06-20 10:01:41"
            ]
        }
    }
}

< > Update on GitHub

Dataset viewer

Explore statistics over split data

Response structure by data type

class_label

float

int

bool

string_label

string_text

list

audio

image

datetime