File size: 7,716 Bytes
7bffaaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a10cf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7bffaaf
 
0a10cf1
7bffaaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a10cf1
7bffaaf
 
 
 
 
 
 
 
 
 
 
 
 
 
0a10cf1
 
 
 
 
 
7bffaaf
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
import logging
from os import mkdir
from os.path import isdir
from os.path import join as pjoin
from pathlib import Path

import streamlit as st

from data_measurements_clusters import Clustering

title = "Dataset Exploration"
description = "Comparison of hate speech detection datasets"
date = "2022-01-26"
thumbnail = "images/books.png"

__COLLECT = """
In order to turn observations of the world into data, choices must be made 
about what counts as data, where to collect data, and how to collect data. 
When collecting language data, this often means selecting websites that allow 
for easily collecting samples of text, and hate speech data is frequently 
collected from social media platforms like Twitter or forums like Wikipedia. 
Each of these decisions results in a specific sample of all the possible 
observations.
"""

__ANNOTATE = """
Once the data is collected, further decisions must be made about how to 
label the data if the data is being used to train a classification system, 
as is common in hate speech detection. These labels must be defined in order 
for the dataset to be consistently labeled, which helps the classification 
model produce more consistent output. This labeling process, called 
*annotation*, can be done by the data collectors, by a set of trained 
annotators with relevant expert knowledge, or by online crowdworkers. Who 
is doing the annotating has a significant effect on the resulting set of 
labels ([Sap et al., 2019](https://aclanthology.org/P19-1163.pdf)).
"""

__STANDARDIZE = """
As a relatively new task in NLP, the definitions that are used across 
different projects vary. Some projects target just hate speech, but others 
may label their data for ‘toxic’, ‘offensive’, or ‘abusive’ language. Still 
others may address related problems such as bullying and harassment. 
This variation makes it difficult to compare across datasets and their 
respective models. As these modeling paradigms become more established, 
definitions grounded in relevant sociological research will need to be 
agreed upon in order for datasets and models in ACM to appropriately 
capture the problems in the world that they set out to address. For more 
on this discussion, see 
[Madukwe et al 2020](https://aclanthology.org/2020.alw-1.18.pdf) and 
[Fortuna et al 2020](https://aclanthology.org/2020.lrec-1.838.pdf).  
"""

__HOW_TO = """
To use the tool, select a dataset. The tool will then show clusters of 
examples in the dataset that have been automatically determined to be similar 
to one another. Below that, you can see specific examples within the cluster, 
the labels for those examples, and the distribution of labels within the 
cluster. Note that cluster 0 will always be the full dataset.
"""

DSET_OPTIONS = {
    "classla/FRENK-hate-en": {
        "binary": {
            "train": {
                ("text",): {
                    "label": {
                        100000: {
                            "sentence-transformers/all-mpnet-base-v2": {
                                "tree": {
                                    "dataset_name": "classla/FRENK-hate-en",
                                    "config_name": "binary",
                                    "split_name": "train",
                                    "input_field_path": ("text",),
                                    "label_name": "label",
                                    "num_rows": 100000,
                                    "model_name": "sentence-transformers/all-mpnet-base-v2",
                                    "file_name": "tree",
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "tweets_hate_speech_detection": {
        "default": {
            "train": {
                ("tweet",): {
                    "label": {
                        100000: {
                            "sentence-transformers/all-mpnet-base-v2": {
                                "tree": {
                                    "dataset_name": "tweets_hate_speech_detection",
                                    "config_name": "default",
                                    "split_name": "train",
                                    "input_field_path": ("tweet",),
                                    "label_name": "label",
                                    "num_rows": 100000,
                                    "model_name": "sentence-transformers/all-mpnet-base-v2",
                                    "file_name": "tree",
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "ucberkeley-dlab/measuring-hate-speech": {
        "default": {
            "train": {
                ("text",): {
                    "hatespeech": {
                        100000: {
                            "sentence-transformers/all-mpnet-base-v2": {
                                "tree": {
                                    "dataset_name": "ucberkeley-dlab/measuring-hate-speech",
                                    "config_name": "default",
                                    "split_name": "train",
                                    "input_field_path": ("text",),
                                    "label_name": "hatespeech",
                                    "num_rows": 100000,
                                    "model_name": "sentence-transformers/all-mpnet-base-v2",
                                    "file_name": "tree",
                                }
                            }
                        }
                    }
                }
            }
        }
    },
}


@st.cache(allow_output_mutation=True)
def download_tree(args):
    clusters = Clustering(**args)
    return clusters


def run_article():
    st.markdown("# Making a Hate Speech Dataset")
    st.markdown("## Collecting observations of the world")
    with st.expander("Collection"):
        st.markdown(__COLLECT, unsafe_allow_html=True)
    st.markdown("## Annotating observations with task labels")
    with st.expander("Annotation"):
        st.markdown(__ANNOTATE, unsafe_allow_html=True)
    st.markdown("## Standardizing the task")
    with st.expander("Standardization"):
        st.markdown(__STANDARDIZE, unsafe_allow_html=True)
    st.markdown("# Exploring datasets")
    with st.expander("How to use the tool"):
        st.markdown(__HOW_TO, unsafe_allow_html=True)

    choose_dset = st.selectbox(
        "Select dataset to visualize",
        DSET_OPTIONS,
    )

    pre_args = DSET_OPTIONS[choose_dset]
    args = pre_args
    while not "dataset_name" in args:
        args = list(args.values())[0]

    clustering = download_tree(args)

    st.markdown("---\n")

    full_tree_fig = clustering.get_full_tree()
    st.plotly_chart(full_tree_fig, use_container_width=True)

    st.markdown("---\n")
    show_node = st.selectbox(
        "Visualize cluster node:",
        range(len(clustering.node_list)),
    )
    st.markdown(
        f"Node {show_node} has {clustering.node_list[show_node]['weight']} examples."
    )
    st.markdown(
        f"Node {show_node} was merged at {clustering.node_list[show_node]['merged_at']:.2f}."
    )
    examplars = clustering.get_node_examplars(show_node)
    st.markdown("---\n")

    label_fig = clustering.get_node_label_chart(show_node)
    examplars_col, labels_col = st.columns([2, 1])
    examplars_col.markdown("#### Node cluster examplars")
    examplars_col.table(examplars)
    labels_col.markdown("#### Node cluster labels")
    labels_col.plotly_chart(label_fig, use_container_width=True)