--- license: mit --- # Model Card for named_entity_recognition.pt This is a fine-tuned model checkpoint for the named entity recognition (NER) task used in the biodata resource inventory performed by the [Global Biodata Coalition](https://globalbiodata.org/) in collaboration with [Chan Zuckerberg Initiative](https://chanzuckerberg.com/). # Model Details ## Model Description This model has been fine-tuned to detect resource names in scientific articles (title and abstract). This is done using a token classification which assigns predicted token labels following the [BIO scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). These are post-processed to determine the predicted "common names" (often an acronym) and "full names" of a resource present in an article. - **Developed by:** Ana-Maria Istrate and Kenneth E. Schackart III - **Shared by:** Kenneth E. Schackart III - **Model type:** RoBERTa (BERT; Transformer) - **Language(s) (NLP):** Python - **License:** MIT - **Finetuned from model:** https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500 ## Model Sources - **Repository:** https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev - **Paper [optional]:** TBA - **Demo [optional]:** TBA # Uses This model can be used find predicted biodata resource names in an article's title and abstract ## Direct Use Direct use of the model has not been assessed or designed. ## Out-of-Scope Use Model should not be used for anything other than the use described in [uses](named_entity_recognition_modelcard.md#uses). # Bias, Risks, and Limitations Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Second, The model was fine-tuned on scientific articles that were manually annotated by 2 curators. Biases in the manual annotation may have affected model fine-tuning. Additionally, manually annotated data were procured using a specific search query to Europe PMC, so generalizability may be limited when applying to articles from other sources. ## Recommendations The model should only be used for identifying resource names in articles from Europe PMC using the [query](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/config/query.txt) present in the GitHub repository. Additionally, only article predicted or known to describe a biodata resource should be used. ## How to Get Started with the Model Follow the direction in the [GitHub repository](https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev). # Training Details ## Training Data The model was trained on the training split from the [labeled training data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv). *Note*: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository. ## Training Procedure The model was trained for 10 epochs, and *F*1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest *F*1-score on the validation set was saved (regardless of epoch number). ### Preprocessing To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All XML tags were removed using a regular expression. ### Speeds, Sizes, Times The model checkpoint is 496 MB. Speed has not been benchmarked. # Evaluation ## Testing Data, Factors & Metrics ### Testing Data The model was evaluated using the test split of the [labeled data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv). ### Metrics The model was evaluated using *F*1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection. ## Results - *F*1-score: 0.717 - Precision: 0.689 - Recall: 0.748 ### Summary # Model Examination The model works satisfactorily for identifying resource names from articles describing biodata resources in the literature. ## Model Architecture and Objective The base model architecture is as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Token classification is performed using a linear sequence classification layer initialized using [transformers.AutoModelForTokenClassification()](https://huggingface.co/docs/transformers/model_doc/auto). ## Compute Infrastructure Model was fine-tuned on Google Colaboratory. ### Hardware Model was fine-tuned using GPU acceleration provided by Google Colaboratory. ### Software Training software was written in Python. # Citation TBA **BibTeX:** TBA **APA:** TBA # Model Card Authors This model card was written by Kenneth E. Schackart III. # Model Card Contact Ken Schackart: