File size: 1,657 Bytes
f352acd
46df0b6
 
 
 
1396c3d
 
f352acd
 
cf54b6f
f352acd
 
46df0b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
title: DataMeasurementsTool
emoji: 🤗
colorFrom: indigo
colorTo: red
sdk: streamlit
sdk_version: 1.0.0
app_file: app.py
pinned: false
python_version: 3.9.6
---

# Data Measurements Tool

🚧 Doing Construction 🚧

[![Generic badge](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/huggingface/data-measurements-tool)

For more information, check out out [blog post](https://huggingface.co/blog/data-measurements-tool)!

# How to run:

After cloning (and potentially setting up your virtual environment), run:

`pip install -r requirements.txt`

This installs all the requirements for the tool.

## Command Line Interface

From there, you can measure different aspects of different datasets by running `run_data_measurements.py` with different options.
The options specify the HF Dataset, the Dataset config, the Dataset columns being measured, the measurements to use, and further details about caching and saving.

To see the full list of options, do:

`python3 run_data_measurements.py -h` or `python3 run_data_measurements.py --help`

Example for hate_speech18 dataset:

         `python3 run_data_measurements.py --dataset="hate_speech18" --config="default" --split="train" --feature="text"`

Example for getting *just* the nPMI measurement from hate_speech18:

         `python3 run_data_measurements.py --dataset=hate_speech18 --config default --split train --feature text --calculation npmi`


Example for IMDB dataset:

         `python3 run_data_measurements.py --dataset="imdb" --config="plain_text" --split="train" --label_field="label" --feature="text"`


## User Interface

`gradio app.py`