npedrazzini commited on
Commit
6bd79c2
1 Parent(s): ab89c20

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -29,7 +29,7 @@ widget:
29
 
30
  # HistoroBERTa-SuicideIncidentClassifier
31
 
32
- A binary classifier based on the RoBERTa-base architecture, fine-tuned on historical British newspaper articles to discern whether news reports discuss (confirmed or speculated) suicide cases, investigations, or court cases related to suicides. It attempts to differentiate between texts where "suicide" or "suicidal" is used literally in the context of actual incidents and those where these terms appear figuratively or in broader, non-specific discussions (e.g., mention of number of suicides in the context of vital statistics, philosophical discussions around the morality of suicide at an abstract level, etc.).
33
 
34
  - **Developed by:** Nilo Pedrazzini, Daniel CS Wilson
35
  - **Language(s) (NLP):** Late Modern English (1780-1920)
@@ -38,17 +38,19 @@ A binary classifier based on the RoBERTa-base architecture, fine-tuned on histor
38
 
39
  # Uses
40
 
41
- The classifier can be used to obtain larger datasets reporting on concrete cases of suicide in historical digitized newspapers to carry out larger-scale analyses on the language used in the reports.
42
 
43
  # Bias, Risks, and Limitations
44
 
45
- The classifier was trained on digitized newspaper data containing many OCR errors and, while text segmentation was meant to capture individual news articles, each labeled item in the training dataset very often spans multiple articles. This will necessarily have introduced bias in the model because of the extra content unrelated to reporting on suicide.
 
46
 
47
  # Training Details
48
 
49
- This model was released upon comparison with other runs, based on accuracy on the evaluation set. Models fine-tuned based on RoBERTa were also compared to those fine-tuned on [bert_1760_1900](https://huggingface.co/Livingwithmachines/bert_1760_1900).
 
50
 
51
- In the following report, the model in this repository corresponds to the one labeled roberta-7, specifically the output of epoch 4, which returned the highest accuracy (>0.96).
52
 
53
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6342a31d5b97f509388807f3/KXqMD4Pchpmkee5CMFFYb.png" style="width: 90%;" />
54
 
 
29
 
30
  # HistoroBERTa-SuicideIncidentClassifier
31
 
32
+ A binary classifier based on the RoBERTa-base architecture, fine-tuned on [historical British newspaper articles](https://huggingface.co/datasets/npedrazzini/hist_suicide_incident) to discern whether news reports discuss (confirmed or speculated) suicide cases, investigations, or court cases related to suicides. It attempts to differentiate between texts where _suicide_(_s_); or _suicidal_ is used in the context of actual incidents and those where these terms appear figuratively or in broader, non-specific discussions (e.g., mention of the number of suicides in the context of vital statistics; philosophical discussions around the morality of suicide at an abstract level; etc.).
33
 
34
  - **Developed by:** Nilo Pedrazzini, Daniel CS Wilson
35
  - **Language(s) (NLP):** Late Modern English (1780-1920)
 
38
 
39
  # Uses
40
 
41
+ The classifier can be used, for instance, to obtain larger datasets reporting on cases of suicide in historical digitized newspapers, to then carry out larger-scale analyses on the language used in the reports.
42
 
43
  # Bias, Risks, and Limitations
44
 
45
+ The classifier was trained on digitized newspaper data containing many OCR errors and, while text segmentation was meant to capture individual news articles, each labeled item in the training dataset very often spans multiple articles. This will necessarily have introduced bias in the model because of the extra content unrelated to reporting on suicide.
46
+ &#9888; **NB** We did not carry out a systematic evaluation of the effect of bad news article segmentation on the quality of the classifier.
47
 
48
  # Training Details
49
 
50
+ This model was released upon comparison with other runs, and its selection was based on its accuracy on the evaluation set.
51
+ Models based on RoBERTa were also compared to those based on [bert_1760_1900](https://huggingface.co/Livingwithmachines/bert_1760_1900), which achieved a slightly lower performance despite hyperparameter tuning.
52
 
53
+ In the following report, the model in this repository corresponds to the one labeled `roberta-7`, specifically the output of epoch 4, which returned the highest accuracy (>0.96).
54
 
55
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6342a31d5b97f509388807f3/KXqMD4Pchpmkee5CMFFYb.png" style="width: 90%;" />
56