mmBERT Impresso Media Sources NER
Source model card for the future Hugging Face model repository impresso-project/mmbert-impresso-mediasources-ner.
The published model should include:
- model weights
- tokenizer files
config.jsonrequirements.txt- this model card
The first publication is a standard Transformers token-classification model. A custom Impresso JSON pipeline will be added later, after the inference interface has been finalized.
The v0.1 model predicts cited source mentions for news agencies and radio stations. The repository name uses "media sources" because the label space may later grow to cover other cited media-source families, such as newspaper citations.
This model is fine-tuned from impresso-project/mmbert-multilingual-impresso-continued-mlm, which continued pretraining from jhu-clsp/mmBERT-base. The upstream mmBERT base model is released under the MIT license.
Evaluation
Validation/dev exact entity metrics:
- Precision: 0.8561
- Recall: 0.9040
- F1: 0.8794
- Correct/gold/predicted entities: 113 / 125 / 132
Test exact entity metrics:
- Precision: 0.9111
- Recall: 0.9213
- F1: 0.9162
- Correct/gold/predicted entities: 164 / 178 / 180
These scores are from the legacy converted French/German HIPE-style evaluation splits. Manual disagreement curation is planned after the first publication.
- Downloads last month
- 147