npedrazzini commited on
Commit
7f841f0
1 Parent(s): 6bd79c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -11
README.md CHANGED
@@ -31,19 +31,60 @@ widget:
31
 
32
  A binary classifier based on the RoBERTa-base architecture, fine-tuned on [historical British newspaper articles](https://huggingface.co/datasets/npedrazzini/hist_suicide_incident) to discern whether news reports discuss (confirmed or speculated) suicide cases, investigations, or court cases related to suicides. It attempts to differentiate between texts where _suicide_(_s_); or _suicidal_ is used in the context of actual incidents and those where these terms appear figuratively or in broader, non-specific discussions (e.g., mention of the number of suicides in the context of vital statistics; philosophical discussions around the morality of suicide at an abstract level; etc.).
33
 
34
- - **Developed by:** Nilo Pedrazzini, Daniel CS Wilson
35
- - **Language(s) (NLP):** Late Modern English (1780-1920)
36
- - **License:** MIT
37
- - **Parent Model:** [roberta-base](https://huggingface.co/FacebookAI/roberta-base)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  # Uses
40
-
41
  The classifier can be used, for instance, to obtain larger datasets reporting on cases of suicide in historical digitized newspapers, to then carry out larger-scale analyses on the language used in the reports.
42
 
43
  # Bias, Risks, and Limitations
44
 
45
  The classifier was trained on digitized newspaper data containing many OCR errors and, while text segmentation was meant to capture individual news articles, each labeled item in the training dataset very often spans multiple articles. This will necessarily have introduced bias in the model because of the extra content unrelated to reporting on suicide.
46
- ⚠ **NB** We did not carry out a systematic evaluation of the effect of bad news article segmentation on the quality of the classifier.
 
47
 
48
  # Training Details
49
 
@@ -66,13 +107,54 @@ Nilo Pedrazzini
66
 
67
  npedrazzini@turing.ac.uk
68
 
69
- # How to Get Started with the Model
70
 
71
  Use the code below to get started with the model.
72
 
73
- <details>
74
- <summary> Click to expand </summary>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
- COMING SOON
77
 
78
- </details>
 
 
 
31
 
32
  A binary classifier based on the RoBERTa-base architecture, fine-tuned on [historical British newspaper articles](https://huggingface.co/datasets/npedrazzini/hist_suicide_incident) to discern whether news reports discuss (confirmed or speculated) suicide cases, investigations, or court cases related to suicides. It attempts to differentiate between texts where _suicide_(_s_); or _suicidal_ is used in the context of actual incidents and those where these terms appear figuratively or in broader, non-specific discussions (e.g., mention of the number of suicides in the context of vital statistics; philosophical discussions around the morality of suicide at an abstract level; etc.).
33
 
34
+ # Overview
35
+ - **Model Name:** HistoroBERTa-SuicideIncidentClassifier
36
+ - **Task**: Binary Classification
37
+ - **Labels**: ['Incident', 'Not Incident']
38
+ - **Base Model:** [RoBERTa (A Robustly Optimized BERT Pretraining Approach) base model](https://huggingface.co/FacebookAI/roberta-base)
39
+ - **Language:** 19th-century English (1780-1920)
40
+ - **Developed by:** [Nilo Pedrazzini](https://huggingface.co/npedrazzini), [Daniel CS Wilson](https://huggingface.co/dcsw2)
41
+
42
+ # Input Format
43
+ A `str`-type input.
44
+
45
+ # Output Format
46
+ The predicted label (`Incident` or `Not Incident`), with the confidence score for each labels.
47
+
48
+ # Examples
49
+
50
+ ### Example 1:
51
+
52
+ **Input:**
53
+ ```
54
+ On Wednesday evening an inquest was held at the Stag and Pheasant before Major Taylor, coroner, and a jury, of whom Mr. Joel Casson was foreman, on the body of John William Birks, grocer, of 23, Huddersfield Road, who cut his throat on Tuesday evening.
55
+ ```
56
+
57
+ **Output:**
58
+ ```
59
+ {
60
+ 'Incident': 0.974,
61
+ 'Not Incident': 0.026
62
+ }
63
+ ```
64
+
65
+ ### Example 2:
66
+
67
+ **Input:**
68
+ ```
69
+ The death-rate by accidents among colliers is, at least, from six to seven times as great as the death-rate from violence among the whole population, including suicides homicides, and the dangerous occupations.
70
+ ```
71
+
72
+ **Output:**
73
+ ```
74
+ {
75
+ 'Not Incident': 0.577,
76
+ 'Incident': 0.423
77
+ }
78
+ ```
79
 
80
  # Uses
 
81
  The classifier can be used, for instance, to obtain larger datasets reporting on cases of suicide in historical digitized newspapers, to then carry out larger-scale analyses on the language used in the reports.
82
 
83
  # Bias, Risks, and Limitations
84
 
85
  The classifier was trained on digitized newspaper data containing many OCR errors and, while text segmentation was meant to capture individual news articles, each labeled item in the training dataset very often spans multiple articles. This will necessarily have introduced bias in the model because of the extra content unrelated to reporting on suicide.
86
+
87
+ &#9888; **NB**: We did not carry out a systematic evaluation of the effect of bad news article segmentation on the quality of the classifier.
88
 
89
  # Training Details
90
 
 
107
 
108
  npedrazzini@turing.ac.uk
109
 
110
+ # How to use the model
111
 
112
  Use the code below to get started with the model.
113
 
114
+ Import and load the model:
115
+
116
+ ```python
117
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
118
+
119
+ model_name = "npedrazzini/HistoroBERTa-SuicideIncidentClassifier"
120
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
121
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
122
+ ```
123
+
124
+ Generate prediction:
125
+
126
+ ```python
127
+ input_text = "The death-rate by accidents among colliers is, at least, from six to seven times as great as the death-rate from violence among the whole population, including suicides homicides, and the dangerous occupations.."
128
+ inputs = tokenizer(input_text, return_tensors="pt")
129
+ outputs = model(**inputs)
130
+ logits = outputs.logits
131
+ probabilities = logits.softmax(dim=-1)
132
+ ```
133
+
134
+ Print predicted label:
135
+
136
+ ```python
137
+ predicted_label_id = probabilities.argmax().item()
138
+ predicted_label = model.config.id2label[predicted_label_id]
139
+ print(predicted_label)
140
+ ```
141
+
142
+ Output:
143
+
144
+ ```
145
+ NotIncident
146
+ ```
147
+
148
+ Print probability of each label:
149
+
150
+ ```python
151
+ label_probabilities = {label: prob for label, prob in zip(model.config.id2label.values(), probabilities.squeeze().tolist())}
152
+ label_probabilities_sorted = dict(sorted(label_probabilities.items(), key=lambda item: item[1], reverse=True))
153
+ print(label_probabilities_sorted)
154
+ ```
155
 
156
+ Output:
157
 
158
+ ```
159
+ {'NotIncident': 0.5880260467529297, 'Incident': 0.4119739532470703}
160
+ ```