dell-research-harvard
/

byline-detection

Token Classification

Inference Endpoints

Model card Files Files and versions Community

byline-detection / README.md

emilys's picture

Create README.md

c2cca7d verified 3 months ago

|

history blame contribute delete

2.45 kB

	---
	license: cc-by-4.0
	language:
	- en
	pipeline_tag: token-classification
	---

	# Byline Detection

	## Model description

	byline_detection is a fine-tuned DistilBERT token classification model, which tags bylines and datelines in news articles.

	It is trained to deal with OCR noise.


	## Intended uses

	You can use this model with Transformers pipeline for NER.

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline

	tokenizer = AutoTokenizer.from_pretrained("dell-research-harvard/byline-detection")
	model = AutoModelForTokenClassification.from_pretrained("dell-research-harvard/byline-detection")

	nlp = pipeline("ner", model=model, tokenizer=tokenizer)
	example = "NEW ORLEANS, (UP) — The Roman Catholic Church, through its leaders in the United States today appealed "

	ner_results = nlp(example)
	print(ner_results)
	```

	## Limitations and bias

	This model was trained on historical news and may reflect biases from a specific period of time. It may also not generalise well to other setting.
	Additionally, the model occasionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases.

	## Training data

	This model was fine-tuned on historical English-language news that had been OCRd from American newspapers.

	#### # of training examples per entity type
	Dataset\|Count
	-\|-
	Train\|1,392
	Dev\|464
	Test\|464


	## Training procedure

	The data was used to fine-tune a DistilBERT model at a learning rate of 2e−5 with a batch size of 16 for 25 epochs.


	## Eval results
	Statistic\|Result
	-\|-
	F1 \| 0.96


	## Notes

	This model card was influence by that of [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER/edit/main/README.md)


	## Citation

	If you use this model, you can cite the following paper:

	```
	@misc{silcock2024newswirelargescalestructureddatabase,
	title={Newswire: A Large-Scale Structured Database of a Century of Historical News},
	author={Emily Silcock and Abhishek Arora and Luca D'Amico-Wong and Melissa Dell},
	year={2024},
	eprint={2406.09490},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2406.09490},
	}
	```

	# Applications

	We applied this model to a century of historical news articles, and georeference the bylines. You can see them all in the [NEWSWIRE dataset](https://huggingface.co/datasets/dell-research-harvard/newswire).