--- language: - en thumbnail: >- https://www.fusion.uni-jena.de/fusionmedia/fusionpictures/fusion-service/fusion-transp.png?height=383&width=680 tags: - bert-base-cased - biodiversity - token-classification - sequence-classification license: apache-2.0 citation: "Abdelmageed, N., Löffler, F., & König-Ries, B. (2023). BiodivBERT: a Pre-Trained Language Model for the Biodiversity Domain." paper: https://ceur-ws.org/Vol-3415/paper-7.pdf metrics: - f1 - precision - recall - accuracy evaluation datasets: - url: https://doi.org/10.5281/zenodo.6554208 - named entity recognition: - COPIOUS - QEMP - BiodivNER - LINNAEUS - Species800 - relation extraction: - GAD - EU-ADR - BiodivRE - BioRelEx training_data: - crawling-keywords: - biodivers - genetic diversity - omic diversity - phylogenetic diversity - soil diversity - population diversity - species diversity - ecosystem diversity - functional diversity - microbial diversity - corpora: - (+Abs) Springer and Elsevier abstracts in the duration of 1990-2020 - (+Abs+Full) Springer and Elsevier abstracts and open access full publication text in the duration of 1990-2020 pre-training-hyperparams: - MAX_LEN = 512 # Default of BERT Tokenizer - MLM_PROP = 0.15 # Data Collator - num_train_epochs = 3 # the minimum sufficient epochs found on many articles && default of trainer here - per_device_train_batch_size = 16 # the maximumn that could be held by V100 on Ara with 512 MAX_LEN was 8 in the old run - per_device_eval_batch_size = 16 # usually as above - gradient_accumulation_steps = 4 # this will grant a minim batch size 16 * 4 * nGPUs. --- # BiodivBERT ## Model description * BiodivBERT is a domain-specific BERT based cased model for the biodiversity literature. * It uses the tokenizer from BERTT base cased model. * BiodivBERT is pre-trained on abstracts and full text from biodiversity literature. * BiodivBERT is fine-tuned on two down stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain. * Please visit our [GitHub Repo](https://github.com/fusion-jena/BiodivBERT) for more details. ## How to use * You can use BiodivBERT via huggingface library as follows: 1. Masked Language Model ```` >>> from transformers import AutoTokenizer, AutoModelForMaskedLM >>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT") >>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT") ```` 2. Token Classification - Named Entity Recognition ```` >>> from transformers import AutoTokenizer, AutoModelForTokenClassification >>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT") >>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT") ```` 3. Sequence Classification - Relation Extraction ```` >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification >>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT") >>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT") ```` ## Training data * BiodivBERT is pre-trained on abstracts and full text from biodiversity domain-related publications. * We used both Elsevier and Springer APIs to crawl such data. * We covered publications over the duration of 1990-2020. ## Evaluation results BiodivBERT overperformed both ``BERT_base_cased``, ``biobert_v1.1``, and ``BiLSTM`` as a baseline approach on the down stream tasks.