AIDO.RNA-1.6B / README.md
probablybots's picture
Update README.md
119a760 verified
|
raw
history blame
4.49 kB

AIDO.RNA-1.6B

AIDO.RNA-1.6B is a general-purpose RNA foundation model with 1.6 billion parameters, trained on 42 million non-coding RNA sequences at single-nucleotide resolution. It achieves state-of-the-art performance on a comprehensive set of tasks, including RNA secondary structure prediction, mRNA-related tasks, RNA function prediction, and RNA inverse folding. After domain adaptation, AIDO.RNA excels in modeling protein-level tasks, highlighting its potential to leverage the central dogma for enhancing biomolecular representations. For more detailed information, please refer to our paper.

description

Model architectural details

AIDO.RNA is an encoder-only transformer and is pre-trained using masked language modeling (MLM) objective. The model architecture parameters are as follows:

hyperparameter value
num-layers 32
hidden-size 2,048
ffn-hidden-size 5,440
num-attn-heads 32
vocab-size 16

Pre-training data

The pre-training data contains 42 million unique ncRNA sequences from RNAcentral version 24.0.

description

Downstream evaluation

description

How to Use

Build any downstream models from this backbone with ModelGenerator

For more information, visit: Model Generator

mgen fit --model SequenceClassification --model.backbone aido_rna_1b600m --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
mgen test --model SequenceClassification --model.backbone aido_rna_1b600m --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>

Or use directly in Python

Embedding

from modelgenerator.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)

Sequence-level Classification

import torch
from modelgenerator.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "aido_rna_1b600m", "model.n_classes": 2}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))

Token-level Classification

import torch
from modelgenerator.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "aido_rna_1b600m", "model.n_classes": 3}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))

Sequence-level Regression

from modelgenerator.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)

Get RNA sequence embedding

from genbio_finetune.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "ACGT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)

Citation

Please cite AIDO.RNA using the following BibTeX code:

@misc{zou_large-scale_2024,
    title = {A Large-Scale Foundation Model for RNA Function and Structure Prediction},
    url = {https://www.biorxiv.org/content/10.1101/2024.11.28.625345v1},
    doi = {10.1101/2024.11.28.625345},
    publisher = {bioRxiv},
    author = {Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Ellington, Caleb N. and Algayres, Robin and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Song, Le and Xing, Eric P.},
    year = {2024},
}