|
--- |
|
license: mit |
|
datasets: |
|
- npedrazzini/hist_suicide_incident |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-classification |
|
tags: |
|
- roberta-based |
|
- historical newspaper |
|
- late modern english |
|
- text classification |
|
widget: |
|
- src: "On Wednesday evening an inquest was held at the Stag and Pheasant before Major Taylor, coroner, and a jury, of whom Mr. Joel Casson was foreman, on the body of John William Birks, grocer, of 23, Huddersfield Road, who cut his throat on Tuesday evening" |
|
example_title: Incident example |
|
- src: "The death-rate by accidents among colliers is, at least, from six to seven times as great as the death-rate from violence among the whole population, including suicides homicides, and the dangerous occupations" |
|
example_title: Not Incident example |
|
--- |
|
|
|
# HistoroBERTa-SuicideIncidentClassifier |
|
|
|
A binary classifier based on the RoBERTa-base architecture, fine-tuned on historical British newspaper articles to discern whether news reports discuss (confirmed or speculated) suicide cases, investigations, or court cases related to suicides. It effectively differentiates between texts where "suicide" or "suicidal" is used literally in the context of actual incidents and those where these terms appear figuratively or in broader, non-specific discussions (e.g., mention of number of suicides in the context of vital statistics, philosophical discussions around the morality of suicide at an abstract level, etc.). |
|
|
|
- **Developed by:** Nilo Pedrazzini, Daniel CS Wilson |
|
- **Language(s) (NLP):** Late Modern English (1780-1920) |
|
- **License:** MIT |
|
- **Parent Model:** [roberta-base](https://huggingface.co/FacebookAI/roberta-base) |
|
|
|
# Uses |
|
|
|
The classifier can be used to obtain larger datasets reporting on concrete cases of suicide in historical digitized newspapers to carry out larger-scale analyses on the language used in the reports. |
|
|
|
# Bias, Risks, and Limitations |
|
|
|
The classifier was trained on digitized newspaper data containing many OCR errors and, while text segmentation was meant to capture individual news articles, each labeled item in the training dataset very often spans multiple articles. This will necessarily have introduced bias in the model because of the extra content unrelated to reporting on suicide. |
|
|
|
# Training Details |
|
|
|
This model was released upon comparison with other runs, based on accuracy on the evaluation set. Models fine-tuned based on RoBERTa were also compared to those fine-tuned on [bert_1760_1900](https://huggingface.co/Livingwithmachines/bert_1760_1900). |
|
|
|
In the following report, the model in this repository corresponds to the one labeled roberta-7, specifically the output of epoch 4, which returned the highest accuracy (>0.96). |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6342a31d5b97f509388807f3/KXqMD4Pchpmkee5CMFFYb.png" style="width: 90%;" /> |
|
|
|
## Training Data |
|
|
|
https://huggingface.co/datasets/npedrazzini/hist_suicide_incident |
|
|
|
# Model Card Authors |
|
|
|
Nilo Pedrazzini |
|
|
|
# Model Card Contact |
|
|
|
npedrazzini@turing.ac.uk |
|
|
|
# How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
COMING SOON |
|
|
|
</details> |