metadata
language:
- en
thumbnail: >-
https://www.fusion.uni-jena.de/fusionmedia/fusionpictures/fusion-service/fusion-transp.png?height=383&width=680
tags:
- bert-base-cased
- biodiversity
- token-classification
- sequence-classification
license: apache-2.0
citation: >-
Abdelmageed, N., Löffler, F., & König-Ries, B. (2023). BiodivBERT: a
Pre-Trained Language Model for the Biodiversity Domain.
paper: https://ceur-ws.org/Vol-3415/paper-7.pdf
metrics:
- f1
- precision
- recall
- accuracy
evaluation datasets:
- url: https://doi.org/10.5281/zenodo.6554208
- named entity recognition:
- COPIOUS
- QEMP
- BiodivNER
- LINNAEUS
- Species800
- relation extraction:
- GAD
- EU-ADR
- BiodivRE
- BioRelEx
training_data:
- crawling-keywords:
- biodivers
- genetic diversity
- omic diversity
- phylogenetic diversity
- soil diversity
- population diversity
- species diversity
- ecosystem diversity
- functional diversity
- microbial diversity
- corpora:
- (+Abs) Springer and Elsevier abstracts in the duration of 1990-2020
- >-
(+Abs+Full) Springer and Elsevier abstracts and open access full
publication text in the duration of 1990-2020
pre-training-hyperparams:
- MAX_LEN = 512
- MLM_PROP = 0.15
- num_train_epochs = 3
- per_device_train_batch_size = 16
- per_device_eval_batch_size = 16
- gradient_accumulation_steps = 4
BiodivBERT
Model description
- BiodivBERT is a domain-specific BERT based cased model for the biodiversity literature.
- It uses the tokenizer from BERTT base cased model.
- BiodivBERT is pre-trained on abstracts and full text from biodiversity literature.
- BiodivBERT is fine-tuned on two down stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain.
- Please visit our GitHub Repo for more details.
How to use
- You can use BiodivBERT via huggingface library as follows:
- Masked Language Model
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
- Token Classification - Named Entity Recognition
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
- Sequence Classification - Relation Extraction
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")
Training data
- BiodivBERT is pre-trained on abstracts and full text from biodiversity domain-related publications.
- We used both Elsevier and Springer APIs to crawl such data.
- We covered publications over the duration of 1990-2020.
Evaluation results
BiodivBERT overperformed both BERT_base_cased
, biobert_v1.1
, and BiLSTM
as a baseline approach on the down stream tasks.