Edit model card

GENA-LM Athaliana 🌱 (gena-lm-bert-base-athaliana)

GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.

gena-lm-bert-base-athaliana is trained on Arabidopsis thaliana genome.

Model description

GENA-LM (gena-lm-bert-base-athaliana) model is trained with a masked language model (MLM) objective, following data preprocessing methods pipeline in the BigBird paper and by masking 15% of tokens. Model config for gena-lm-bert-base-athaliana is similar to the bert-base:

  • 512 Maximum sequence length
  • 12 Layers, 12 Attention heads
  • 768 Hidden size
  • 32k Vocabulary size

We pre-trained gena-lm-bert-base-athaliana on data obtained from Kang et al., using this download link and contains chromosome-level genomes of 32 A. thaliana ecotypes. Pre-training was performed for 1,700,000 iterations with batch size 256 and sequence length was equal to 512 tokens. We modified Transformer to use Pre-Layer normalization. We upload the checkpoint with the best loss on validation set (iteration 425000) to main branch and the latest checkpoint to step_1700000 branch.

Source code and data: https://github.com/AIRI-Institute/GENA_LM

Paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594

Examples

How to load pre-trained model for Masked Language Modeling

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-athaliana')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-athaliana', trust_remote_code=True)

How to load pre-trained model to fine-tune it on classification task

Get model class from GENA-LM repository:

git clone https://github.com/AIRI-Institute/GENA_LM.git
from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-athaliana')
model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-base-athaliana')

or you can just download modeling_bert.py and put it close to your code.

OR you can get model class from HuggingFace AutoModel:

from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-athaliana', trust_remote_code=True)
gena_module_name = model.__class__.__module__
print(gena_module_name)
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
print(cls)
model = cls.from_pretrained('AIRI-Institute/gena-lm-bert-base-athaliana', num_labels=2)

Evaluation

For evaluation results, see our paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594

Citation

@article{GENA_LM,
    author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
    title = {GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences},
    elocation-id = {2023.06.12.544594},
    year = {2023},
    doi = {10.1101/2023.06.12.544594},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2023/11/01/2023.06.12.544594},
    eprint = {https://www.biorxiv.org/content/early/2023/11/01/2023.06.12.544594.full.pdf},
    journal = {bioRxiv}
}
Downloads last month
106
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including AIRI-Institute/gena-lm-bert-base-athaliana