AIDO.RNA-1.6B
AIDO.RNA-1.6B is a general-purpose RNA foundation model with 1.6 billion parameters, trained on 42 million non-coding RNA sequences at single-nucleotide resolution. It achieves state-of-the-art performance on a comprehensive set of tasks, including RNA secondary structure prediction, mRNA-related tasks, RNA function prediction, and RNA inverse folding. After domain adaptation, AIDO.RNA excels in modeling protein-level tasks, highlighting its potential to leverage the central dogma for enhancing biomolecular representations. For more detailed information, please refer to our paper.
Model architectural details
AIDO.RNA is an encoder-only transformer and is pre-trained using masked language modeling (MLM) objective. The model architecture parameters are as follows:
hyperparameter | value |
---|---|
num-layers | 32 |
hidden-size | 2,048 |
ffn-hidden-size | 5,440 |
num-attn-heads | 32 |
vocab-size | 16 |
Pre-training data
The pre-training data contains 42 million unique ncRNA sequences from RNAcentral version 24.0.
Downstream evaluation
How to Use
Build any downstream models from this backbone
Get RNA sequence embedding
from genbio_finetune.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "ACGT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)
Sequence-level regression
from genbio_finetune.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
Sequence-level classification
import torch
from genbio_finetune.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "aido_rna_1b600m", "model.n_classes": 2}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
Token-level classification
import torch
from genbio_finetune.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "aido_rna_1b600m", "model.n_classes": 3}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
Pairwise token-level classification
@Sazan TODO
RNA inverse folding
@Sazan
Or use our one-liner CLI to finetune or evaluate any of the above!
mgen fit --model SequenceClassification --model.backbone aido_rna_1b600m --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
mgen test --model SequenceClassification --model.backbone aido_rna_1b600m --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
For more information, visit: ModelGenerator
Citation
Please cite AIDO.RNA using the following BibTeX code:
@inproceedings{
zou2024a,
title={A Large-Scale Foundation Model for {RNA} Function and Structure Prediction},
author={Shuxian Zou and Tianhua Tao and Sazan Mahbub and Caleb Ellington and Robin Jonathan Algayres and Dian Li and Yonghao Zhuang and Hongyi Wang and Le Song and Eric P. Xing},
booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
year={2024},
url={https://openreview.net/forum?id=Gzo3JMPY8w}
}
License
@Hongyi TODO