|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- en |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
# Byline Detection |
|
|
|
## Model description |
|
|
|
**byline_detection** is a fine-tuned DistilBERT token classification model, which tags bylines and datelines in news articles. |
|
|
|
It is trained to deal with OCR noise. |
|
|
|
|
|
## Intended uses |
|
|
|
You can use this model with Transformers pipeline for NER. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from transformers import pipeline |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("dell-research-harvard/byline-detection") |
|
model = AutoModelForTokenClassification.from_pretrained("dell-research-harvard/byline-detection") |
|
|
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer) |
|
example = "NEW ORLEANS, (UP) — The Roman Catholic Church, through its leaders in the United States today appealed " |
|
|
|
ner_results = nlp(example) |
|
print(ner_results) |
|
``` |
|
|
|
## Limitations and bias |
|
|
|
This model was trained on historical news and may reflect biases from a specific period of time. It may also not generalise well to other setting. |
|
Additionally, the model occasionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases. |
|
|
|
## Training data |
|
|
|
This model was fine-tuned on historical English-language news that had been OCRd from American newspapers. |
|
|
|
#### # of training examples per entity type |
|
Dataset|Count |
|
-|- |
|
Train|1,392 |
|
Dev|464 |
|
Test|464 |
|
|
|
|
|
## Training procedure |
|
|
|
The data was used to fine-tune a DistilBERT model at a learning rate of 2e−5 with a batch size of 16 for 25 epochs. |
|
|
|
|
|
## Eval results |
|
Statistic|Result |
|
-|- |
|
F1 | 0.96 |
|
|
|
|
|
## Notes |
|
|
|
This model card was influence by that of [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER/edit/main/README.md) |
|
|
|
|
|
## Citation |
|
|
|
If you use this model, you can cite the following paper: |
|
|
|
``` |
|
@misc{silcock2024newswirelargescalestructureddatabase, |
|
title={Newswire: A Large-Scale Structured Database of a Century of Historical News}, |
|
author={Emily Silcock and Abhishek Arora and Luca D'Amico-Wong and Melissa Dell}, |
|
year={2024}, |
|
eprint={2406.09490}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2406.09490}, |
|
} |
|
``` |
|
|
|
# Applications |
|
|
|
We applied this model to a century of historical news articles, and georeference the bylines. You can see them all in the [NEWSWIRE dataset](https://huggingface.co/datasets/dell-research-harvard/newswire). |
|
|
|
|