metadata

language:
  - en
thumbnail: >-
  https://www.fusion.uni-jena.de/fusionmedia/fusionpictures/fusion-service/fusion-transp.png?height=383&width=680
tags:
  - bert-base-cased
  - biodiversity
  - token-classification
  - sequence-classification
license: apache-2.0
citation: >-
  Abdelmageed, N., Löffler, F., & König-Ries, B. (2023). BiodivBERT: a
  Pre-Trained Language Model for the Biodiversity Domain.
paper: https://ceur-ws.org/Vol-3415/paper-7.pdf
metrics:
  - f1
  - precision
  - recall
  - accuracy
evaluation datasets:
  - url: https://doi.org/10.5281/zenodo.6554208
  - named entity recognition:
      - COPIOUS
      - QEMP
      - BiodivNER
      - LINNAEUS
      - Species800
  - relation extraction:
      - GAD
      - EU-ADR
      - BiodivRE
      - BioRelEx
training_data:
  - crawling-keywords:
      - biodivers
      - genetic diversity
      - omic diversity
      - phylogenetic diversity
      - soil diversity
      - population diversity
      - species diversity
      - ecosystem diversity
      - functional diversity
      - microbial diversity
  - corpora:
      - (+Abs) Springer and Elsevier abstracts in the duration of 1990-2020
      - >-
        (+Abs+Full) Springer and Elsevier abstracts and open access full
        publication text in the duration of 1990-2020
pre-training-hyperparams:
  - MAX_LEN = 512
  - MLM_PROP = 0.15
  - num_train_epochs = 3
  - per_device_train_batch_size = 16
  - per_device_eval_batch_size = 16
  - gradient_accumulation_steps = 4

BiodivBERT

Model description

BiodivBERT is a domain-specific BERT based cased model for the biodiversity literature.
It uses the tokenizer from BERTT base cased model.
BiodivBERT is pre-trained on abstracts and full text from biodiversity literature.
BiodivBERT is fine-tuned on two down stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain.
Please visit our GitHub Repo for more details.

How to use

You can use BiodivBERT via huggingface library as follows:

Masked Language Model

>>> from transformers import AutoTokenizer, AutoModelForMaskedLM

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")

Token Classification - Named Entity Recognition

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")

Sequence Classification - Relation Extraction

>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")

Training data

BiodivBERT is pre-trained on abstracts and full text from biodiversity domain-related publications.
We used both Elsevier and Springer APIs to crawl such data.
We covered publications over the duration of 1990-2020.

Evaluation results

BiodivBERT overperformed both BERT_base_cased, biobert_v1.1, and BiLSTM as a baseline approach on the down stream tasks.