File size: 2,453 Bytes
c2cca7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: cc-by-4.0
language:
- en
pipeline_tag: token-classification
---

# Byline Detection

## Model description

**byline_detection** is a fine-tuned  DistilBERT token classification model, which tags bylines and datelines in news articles. 

It is trained to deal with OCR noise. 


## Intended uses

You can use this model with Transformers pipeline for NER.

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dell-research-harvard/byline-detection")
model = AutoModelForTokenClassification.from_pretrained("dell-research-harvard/byline-detection")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "NEW ORLEANS, (UP) — The Roman Catholic Church, through its leaders in the United States today appealed "

ner_results = nlp(example)
print(ner_results)
```

## Limitations and bias

This model was trained on historical news and may reflect biases from a specific period of time. It may also not generalise well to other setting. 
Additionally, the model occasionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases. 

## Training data

This model was fine-tuned on historical English-language news that had been OCRd from American newspapers. 

#### # of training examples per entity type
Dataset|Count
-|-
Train|1,392
Dev|464
Test|464


## Training procedure

The data was used to fine-tune a DistilBERT model at a learning rate of 2e−5 with a batch size of 16 for 25 epochs.


## Eval results
Statistic|Result
-|-
F1 | 0.96


## Notes

This model card was influence by that of [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER/edit/main/README.md)


## Citation

If you use this model, you can cite the following paper: 

```
@misc{silcock2024newswirelargescalestructureddatabase,
      title={Newswire: A Large-Scale Structured Database of a Century of Historical News}, 
      author={Emily Silcock and Abhishek Arora and Luca D'Amico-Wong and Melissa Dell},
      year={2024},
      eprint={2406.09490},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.09490}, 
}
```

# Applications

We applied this model to a century of historical news articles, and georeference the bylines. You can see them all in the [NEWSWIRE dataset](https://huggingface.co/datasets/dell-research-harvard/newswire).