File size: 4,872 Bytes
d98200b 764cdde d98200b 764cdde d98200b 764cdde d98200b 764cdde d98200b 764cdde d98200b 764cdde d98200b 764cdde d98200b 764cdde d98200b 764cdde d98200b 764cdde d98200b 3a9028d 764cdde 3a9028d d98200b 764cdde 3a9028d 764cdde d98200b 3a9028d d98200b 764cdde 3a9028d 764cdde 3a9028d 764cdde d98200b 764cdde d98200b 764cdde |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
tags:
- dna
- human_genome
---
# GENA-LM (gena-lm-bigbird-base-sparse)
GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.
GENA-LM models are transformer masked language models trained on human DNA sequence.
`gena-lm-bigbird-base-sparse` follows the BigBird architecture and uses sparse attention from DeepSpeed.
Differences between GENA-LM (`gena-lm-bigbird-base-sparse`) and DNABERT:
- BPE tokenization instead of k-mers;
- input sequence size is about 36000 nucleotides (4096 BPE tokens) compared to 512 nucleotides of DNABERT;
- pre-training on T2T vs. GRCh38.p13 human genome assembly.
Source code and data: https://github.com/AIRI-Institute/GENA_LM
Paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1
## Installation
`gena-lm-bigbird-base-sparse` sparse ops require DeepSpeed.
### DeepSpeed
DeepSpeed installation is needed to work with SparseAttention versions of language models. DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100).
```bash
pip install triton==1.0.0
DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.6.0 --global-option="build_ext" --global-option="-j8" --no-cache
```
and check installation with
```bash
ds_report
```
### APEX for FP16
Install APEX https://github.com/NVIDIA/apex#quick-start
```
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
```
## Examples
### How to load pre-trained model for Masked Language Modeling
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse', trust_remote_code=True)
```
### How to load pre-trained model to fine-tune it on classification task
Get model class from GENA-LM repository:
```bash
git clone https://github.com/AIRI-Institute/GENA_LM.git
```
```python
from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse')
model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse')
```
or you can just download [modeling_bert.py](https://github.com/AIRI-Institute/GENA_LM/tree/main/src/gena_lm) and put it close to your code.
OR you can get model class from HuggingFace AutoModel:
```python
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse', trust_remote_code=True)
gena_module_name = model.__class__.__module__
print(gena_module_name)
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
print(cls)
model = cls.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse', num_labels=2)
```
## Model description
GENA-LM (`gena-lm-bigbird-base-sparse`) model is trained in a masked language model (MLM) fashion, following the methods proposed in the BigBird paper by masking 15% of tokens. Model config for `gena-lm-bigbird-base-sparse` is similar to the `google/bigbird-roberta-base`:
- 4096 Maximum sequence length
- 12 Layers, 12 Attention heads
- 768 Hidden size
- sparse config:
- block size: 64
- random blocks: 3
- global blocks: 2
- sliding window blocks: 3
- Rotary positional embeddings
- 32k Vocabulary size, tokenizer trained on DNA data.
We pre-trained `gena-lm-bigbird-base-sparse` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). Pre-training was performed for 810,000 iterations with batch size 256. We modified Transformer with [Pre-Layer normalization](https://arxiv.org/abs/2002.04745).
## Evaluation
For evaluation results, see our paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1
## Citation
```bibtex
@article{GENA_LM,
author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
elocation-id = {2023.06.12.544594},
year = {2023},
doi = {10.1101/2023.06.12.544594},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594},
eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
journal = {bioRxiv}
}
``` |