BiodivBERT / README.md
NoYo25's picture
Update README.md
e837441
metadata
language:
  - en
thumbnail: >-
  https://www.fusion.uni-jena.de/fusionmedia/fusionpictures/fusion-service/fusion-transp.png?height=383&width=680
tags:
  - bert-base-cased
  - biodiversity
  - token-classification
  - sequence-classification
license: apache-2.0
citation: >-
  Abdelmageed, N., Löffler, F., & König-Ries, B. (2023). BiodivBERT: a
  Pre-Trained Language Model for the Biodiversity Domain.
paper: https://ceur-ws.org/Vol-3415/paper-7.pdf
metrics:
  - f1
  - precision
  - recall
  - accuracy
evaluation datasets:
  - url: https://doi.org/10.5281/zenodo.6554208
  - named entity recognition:
      - COPIOUS
      - QEMP
      - BiodivNER
      - LINNAEUS
      - Species800
  - relation extraction:
      - GAD
      - EU-ADR
      - BiodivRE
      - BioRelEx
training_data:
  - crawling-keywords:
      - biodivers
      - genetic diversity
      - omic diversity
      - phylogenetic diversity
      - soil diversity
      - population diversity
      - species diversity
      - ecosystem diversity
      - functional diversity
      - microbial diversity
  - corpora:
      - (+Abs) Springer and Elsevier abstracts in the duration of 1990-2020
      - >-
        (+Abs+Full) Springer and Elsevier abstracts and open access full
        publication text in the duration of 1990-2020
pre-training-hyperparams:
  - MAX_LEN = 512
  - MLM_PROP = 0.15
  - num_train_epochs = 3
  - per_device_train_batch_size = 16
  - per_device_eval_batch_size = 16
  - gradient_accumulation_steps = 4

BiodivBERT

Model description

  • BiodivBERT is a domain-specific BERT based cased model for the biodiversity literature.
  • It uses the tokenizer from BERTT base cased model.
  • BiodivBERT is pre-trained on abstracts and full text from biodiversity literature.
  • BiodivBERT is fine-tuned on two down stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain.
  • Please visit our GitHub Repo for more details.

How to use

  • You can use BiodivBERT via huggingface library as follows:
  1. Masked Language Model
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
  1. Token Classification - Named Entity Recognition
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
  1. Sequence Classification - Relation Extraction
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")

Training data

  • BiodivBERT is pre-trained on abstracts and full text from biodiversity domain-related publications.
  • We used both Elsevier and Springer APIs to crawl such data.
  • We covered publications over the duration of 1990-2020.

Evaluation results

BiodivBERT overperformed both BERT_base_cased, biobert_v1.1, and BiLSTM as a baseline approach on the down stream tasks.