Dataset viewer documentation
Explore statistics over split data
Explore statistics over split data
The dataset viewer provides a /statistics endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.
The /statistics endpoint requires three query parameters:
- dataset: the dataset name, for example- nyu-mll/glue
- config: the subset name, for example- cola
- split: the split name, for example- train
Let’s get some stats for nyu-mll/glue dataset, cola subset, train split:
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/statistics?dataset=nyu-mll/glue&config=cola&split=train"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()The response JSON contains three keys:
- num_examples- number of samples in a split or number of samples in the first chunk of data if dataset is larger than 5GB (see- partialfield below).
- statistics- list of dictionaries of statistics per each column, each dictionary has three keys:- column_name,- column_type, and- column_statistics. Content of- column_statisticsdepends on a column type, see Response structure by data types for more details
- partial-- trueif statistics are computed on the first 5 GB of data, not on the full split,- falseotherwise.
{
  "num_examples": 8551,
  "statistics": [
    {
      "column_name": "idx",
      "column_type": "int",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 0,
        "max": 8550,
        "mean": 4275,
        "median": 4275,
        "std": 2468.60541,
        "histogram": {
          "hist": [
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            847
          ],
          "bin_edges": [
            0,
            856,
            1712,
            2568,
            3424,
            4280,
            5136,
            5992,
            6848,
            7704,
            8550
          ]
        }
      }
    },
    {
      "column_name": "label",
      "column_type": "class_label",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "no_label_count": 0,
        "no_label_proportion": 0,
        "n_unique": 2,
        "frequencies": {
          "unacceptable": 2528,
          "acceptable": 6023
        }
      }
    },
    {
      "column_name": "sentence",
      "column_type": "string_text",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 6,
        "max": 231,
        "mean": 40.70074,
        "median": 37,
        "std": 19.14431,
        "histogram": {
          "hist": [
            2260,
            4512,
            1262,
            380,
            102,
            26,
            6,
            1,
            1,
            1
          ],
          "bin_edges": [
            6,
            29,
            52,
            75,
            98,
            121,
            144,
            167,
            190,
            213,
            231
          ]
        }
      }
    }
  ],
  "partial": false
}Response structure by data type
Currently, statistics are supported for strings, float and integer numbers, lists, datetimes, audio and image data and the special datasets.ClassLabel feature type of the datasets library.
column_type in response can be one of the following values:
- class_label- for- datasets.ClassLabelfeature which represents categorical data
- float- for float data types
- int- for integer data types
- bool- for boolean data type
- string_label- for string data types being treated as categories (see below)
- string_text- for string data types if they do not represent categories (see below)
- list- for lists of any other data types (including lists)
- audio- for audio data
- image- for image data
- datetime- for datetime data
class_label
This type represents categorical data encoded as ClassLabel feature. The following measures are computed:
- number and proportion of nullvalues
- number and proportion of values with no label
- number of unique values (excluding nullandno label)
- value counts for each label (excluding nullandno label)
Example
{
  "column_name": "label",
  "column_type": "class_label",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "no_label_count": 0,
    "no_label_proportion": 0,
    "n_unique": 2,
    "frequencies": {
      "unacceptable": 2528,
      "acceptable": 6023
    }
  }
}float
The following measures are returned for float data types:
- minimum, maximum, mean, median, and standard deviation values
- number and proportion of nullandNaNvalues (NaNvalues are treated asnull)
- histogram with 10 bins
Example
{
  "column_name": "clarity",
  "column_type": "float",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "min": 0,
    "max": 2,
    "mean": 1.67206,
    "median": 1.8,
    "std": 0.38714,
    "histogram": {
      "hist": [
        17,
        12,
        48,
        52,
        135,
        188,
        814,
        15,
        1628,
        2048
      ],
      "bin_edges": [
        0,
        0.2,
        0.4,
        0.6,
        0.8,
        1,
        1.2,
        1.4,
        1.6,
        1.8,
        2
      ]
    }
  }
}int
The following measures are returned for integer data types:
- minimum, maximum, mean, median, and standard deviation values
- number and proportion of nullvalues
- histogram with less than or equal to 10 bins
Example
{
    "column_name": "direction",
    "column_type": "int",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": 0,
        "max": 1,
        "mean": 0.49925,
        "median": 0.0,
        "std": 0.5,
        "histogram": {
            "hist": [
                50075,
                49925
            ],
            "bin_edges": [
                0,
                1,
                1
            ]
        }
    }
}bool
The following measures are returned for bool data type:
- number and proportion of nullvalues
- value counts for 'True'and'False'values
Example
{
  "column_name": "penalty",
  "column_type": "bool",
  "column_statistics":
    {
        "nan_count": 3,
        "nan_proportion": 0.15,
        "frequencies": {
            "False": 7,
            "True": 10
        }
    }
}string_label
If the proportion of unique values in a string column within requested split is lower than or equal to 0.2 and the number of unique values is lower than 1000, or if the number of unique values is lower or equal to 10 (independently of the proportion), it is considered to be a category. The following measures are returned:
- number and proportion of nullvalues
- number of unique values (excluding null)
- value counts for each label (excluding null)
Example
{
  "column_name": "answerKey",
  "column_type": "string_label",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "n_unique": 4,
    "frequencies": {
      "D": 1221,
      "C": 1146,
      "A": 1378,
      "B": 1212
    }
  }
}
string_text
If string column does not satisfy the conditions to be treated as a string_label, it is considered to be a column containing texts and response contains statistics over text lengths which are calculated by character number. The following measures are computed:
- minimum, maximum, mean, median, and standard deviation of text lengths
- number and proportion of nullvalues
- histogram of text lengths with 10 bins
Example
{
  "column_name": "sentence",
  "column_type": "string_text",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "min": 6,
    "max": 231,
    "mean": 40.70074,
    "median": 37,
    "std": 19.14431,
    "histogram": {
      "hist": [
        2260,
        4512,
        1262,
        380,
        102,
        26,
        6,
        1,
        1,
        1
      ],
      "bin_edges": [
        6,
        29,
        52,
        75,
        98,
        121,
        144,
        167,
        190,
        213,
        231
      ]
    }
  }
}list
For lists, the distribution of their lengths is computed. The following measures are returned:
- minimum, maximum, mean, median, and standard deviation of lists lengths
- number and proportion of nullvalues
- histogram of lists lengths with up to 10 bins
Example
{
    "column_name": "chat_history",
    "column_type": "list",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": 1,
        "max": 3,
        "mean": 1.01741,
        "median": 1.0,
        "std": 0.13146,
        "histogram": {
            "hist": [
                11177,
                196,
                1
            ],
            "bin_edges": [
                1,
                2,
                3,
                3
            ]
        }
    }
}Note that dictionaries of lists are not supported.
audio
For audio data, the distribution of audio files durations is computed. The following measures are returned:
- minimum, maximum, mean, median, and standard deviation of audio files durations
- number and proportion of nullvalues
- histogram of audio files durations with 10 bins
Example
{
    "column_name": "audio",
    "column_type": "audio",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 1.02,
        "max": 15,
        "mean": 13.93042,
        "median": 14.77,
        "std": 2.63734,
        "histogram": {
            "hist": [
                32,
                25,
                18,
                24,
                22,
                17,
                18,
                19,
                55,
                1770
            ],
            "bin_edges": [
                1.02,
                2.418,
                3.816,
                5.214,
                6.612,
                8.01,
                9.408,
                10.806,
                12.204,
                13.602,
                15
            ]
        }
    }
}image
For image data, the distribution of images widths is computed. The following measures are returned:
- minimum, maximum, mean, median, and standard deviation of widths of image files
- number and proportion of nullvalues
- histogram of images widths with 10 bins
Example
{
    "column_name": "image",
    "column_type": "image",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": 256,
        "max": 873,
        "mean": 327.99339,
        "median": 341.0,
        "std": 60.07286,
        "histogram": {
            "hist": [
                1734,
                1637,
                1326,
                121,
                10,
                3,
                1,
                3,
                1,
                2
            ],
            "bin_edges": [
                256,
                318,
                380,
                442,
                504,
                566,
                628,
                690,
                752,
                814,
                873
            ]
        }
    }
}datetime
The distribution of datetime is computed. The following measures are returned:
- minimum, maximum, mean, median, and standard deviation of datetimes represented as strings with precision up to seconds
- number and proportion of nullvalues
- histogram of datetimes with 10 bins
Example
{
    "column_name": "date",
    "column_type": "datetime",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": "2013-05-18 04:54:11",
        "max": "2013-06-20 10:01:41",
        "mean": "2013-05-27 18:03:39",
        "median": "2013-05-23 11:55:50",
        "std": "11 days, 4:57:32.322450",
        "histogram": {
            "hist": [
                318776,
                393036,
                173904,
                0,
                0,
                0,
                0,
                0,
                0,
                206284
            ],
            "bin_edges": [
                "2013-05-18 04:54:11",
                "2013-05-21 12:36:57",
                "2013-05-24 20:19:43",
                "2013-05-28 04:02:29",
                "2013-05-31 11:45:15",
                "2013-06-03 19:28:01",
                "2013-06-07 03:10:47",
                "2013-06-10 10:53:33",
                "2013-06-13 18:36:19",
                "2013-06-17 02:19:05",
                "2013-06-20 10:01:41"
            ]
        }
    }
}