dell-research-harvard
/

historical_newspaper_ner

Token Classification

Inference Endpoints

Model card Files Files and versions Community

emilys commited on Mar 23, 2024

Commit

6b7b2a0

·

verified ·

1 Parent(s): 8f96db2

Update README.md

Files changed (1) hide show

README.md +88 -0

README.md CHANGED Viewed

@@ -1,3 +1,91 @@
 ---
 license: cc-by-2.0
 ---

 ---
 license: cc-by-2.0
+language:
+- en
+pipeline_tag: token-classification
 ---
+# Historical newspaper NER
+## Model description
+**historical_newspaper_ner** is a fine-tuned Roberta-large model for use on text that may contain OCR errors.
+It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).
+It was trained on a custom historical newspaper dataset, with highly accurate labels. All data were double entered by two highly skilled Harvard undergraduates and all discrepancies were resolved by hand.
+## Intended uses
+You can use this model with Transformers pipeline for NER.
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+from transformers import pipeline
+tokenizer = AutoTokenizer.from_pretrained("dell-research-harvard/historical_newspaper_ner")
+model = AutoModelForTokenClassification.from_pretrained("dell-research-harvard/historical_newspaper_ner")
+nlp = pipeline("ner", model=model, tokenizer=tokenizer)
+example = "My name is Wolfgang and I live in Berlin"
+ner_results = nlp(example)
+print(ner_results)
+```
+## Limitations and bias
+This model was trained on historical news and may reflect biases from a specific period of time. It may also not generalise well to other setting.
+Additionally, the model occasionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases.
+## Training data
+The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. Each token will be classified as one of the following classes:
+Abbreviation|Description
+-|-
+O|Outside of a named entity
+B-MISC |Beginning of a miscellaneous entity
+I-MISC | Miscellaneous entity
+B-PER |Beginning of a person’s name
+I-PER |Person’s name
+B-ORG |Beginning of an organization
+I-ORG |organization
+B-LOC |Beginning of a location
+I-LOC |Location
+This model was fine-tuned on historical English-language news that had been OCRd from American newspapers.
+Unlike other NER datasets, this data has highly accurate labels. All data were double entered by two highly skilled Harvard undergraduates and all discrepancies were resolved by hand.
+#### # of training examples per entity type
+Dataset|Article|PER|ORG|LOC|MISC
+-|-|-|-|-|-
+Train|227|1345|450|1191|1037
+Dev|48|231|59|192|149
+Test|48|261|83|199|181
+## Training procedure
+The data was used to fine-tune a Roberta-Large model (Liu et. al, 2020) at a learning rate of 4.7e-05 with a batch size of 128 for 184 epochs.
+## Eval results
+entities|f1
+-|-
+PER | 94.3
+ORG | 80.7
+LOC | 90.8
+MISC | 79.6
+Overall (stringent) | 86.5
+Overall (ignoring entity type) | 90.4
+## Notes
+This model card was influence by that of [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER/edit/main/README.md)