GENA-LM (gena-lm-bigbird-base-sparse-t2t)

GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.

GENA-LM models are transformer masked language models trained on human DNA sequence.

gena-lm-bigbird-base-sparse-t2t follows the BigBird architecture and uses sparse attention from DeepSpeed.

Differences between GENA-LM (gena-lm-bigbird-base-sparse-t2t) and DNABERT:

  • BPE tokenization instead of k-mers;
  • input sequence size is about 36000 nucleotides (4096 BPE tokens) compared to 512 nucleotides of DNABERT;
  • pre-training on T2T vs. GRCh38.p13 human genome assembly.

Source code and data: https://github.com/AIRI-Institute/GENA_LM

Paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1

Installation

gena-lm-bigbird-base-sparse-t2t sparse ops require DeepSpeed.

DeepSpeed

DeepSpeed installation is needed to work with SparseAttention versions of language models. DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100).

pip install triton==1.0.0
DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.6.0 --global-option="build_ext" --global-option="-j8" --no-cache

and check installation with

ds_report

APEX for FP16

Install APEX https://github.com/NVIDIA/apex#quick-start

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Examples

How to load pre-trained model for Masked Language Modeling

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t', trust_remote_code=True)

How to load pre-trained model to fine-tune it on classification task

Get model class from GENA-LM repository:

git clone https://github.com/AIRI-Institute/GENA_LM.git
from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')

or you can just download modeling_bert.py and put it close to your code.

OR you can get model class from HuggingFace AutoModel:

from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t', trust_remote_code=True)
gena_module_name = model.__class__.__module__
print(gena_module_name)
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
print(cls)
model = cls.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t', num_labels=2)

Model description

GENA-LM (gena-lm-bigbird-base-sparse-t2t) model is trained in a masked language model (MLM) fashion, following the methods proposed in the BigBird paper by masking 15% of tokens. Model config for gena-lm-bigbird-base-sparse-t2t is similar to the google/bigbird-roberta-base:

  • 4096 Maximum sequence length
  • 12 Layers, 12 Attention heads
  • 768 Hidden size
  • sparse config:
    • block size: 64
    • random blocks: 3
    • global blocks: 2
    • sliding window blocks: 3
  • Rotary positional embeddings
  • 32k Vocabulary size, tokenizer trained on DNA data.

We pre-trained gena-lm-bigbird-base-sparse-t2t using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling mutations from 1000-genome SNPs (gnomAD dataset). Pre-training was performed for 800,000 iterations with batch size 256. We modified Transformer with Pre-Layer normalization.

Evaluation

For evaluation results, see our paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1

Citation

@article{GENA_LM,
    author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
    title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
    elocation-id = {2023.06.12.544594},
    year = {2023},
    doi = {10.1101/2023.06.12.544594},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594},
    eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
    journal = {bioRxiv}
}
Downloads last month
13
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including AIRI-Institute/gena-lm-bigbird-base-sparse-t2t