--- license: cc-by-4.0 language: - en pipeline_tag: token-classification --- # Byline Detection ## Model description **byline_detection** is a fine-tuned DistilBERT token classification model, which tags bylines and datelines in news articles. It is trained to deal with OCR noise. ## Intended uses You can use this model with Transformers pipeline for NER. ```python from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline tokenizer = AutoTokenizer.from_pretrained("dell-research-harvard/byline-detection") model = AutoModelForTokenClassification.from_pretrained("dell-research-harvard/byline-detection") nlp = pipeline("ner", model=model, tokenizer=tokenizer) example = "NEW ORLEANS, (UP) ā€” The Roman Catholic Church, through its leaders in the United States today appealed " ner_results = nlp(example) print(ner_results) ``` ## Limitations and bias This model was trained on historical news and may reflect biases from a specific period of time. It may also not generalise well to other setting. Additionally, the model occasionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases. ## Training data This model was fine-tuned on historical English-language news that had been OCRd from American newspapers. #### # of training examples per entity type Dataset|Count -|- Train|1,392 Dev|464 Test|464 ## Training procedure The data was used to fine-tune a DistilBERT model at a learning rate of 2eāˆ’5 with a batch size of 16 for 25 epochs. ## Eval results Statistic|Result -|- F1 | 0.96 ## Notes This model card was influence by that of [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER/edit/main/README.md) ## Citation If you use this model, you can cite the following paper: ``` @misc{silcock2024newswirelargescalestructureddatabase, title={Newswire: A Large-Scale Structured Database of a Century of Historical News}, author={Emily Silcock and Abhishek Arora and Luca D'Amico-Wong and Melissa Dell}, year={2024}, eprint={2406.09490}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.09490}, } ``` # Applications We applied this model to a century of historical news articles, and georeference the bylines. You can see them all in the [NEWSWIRE dataset](https://huggingface.co/datasets/dell-research-harvard/newswire).