# AIDO.RNA 1.6B AIDO.RNA is a 1.6B parameter RNA foundation model trained on 42 million non-coding RNA sequences at single-nucleotide resolution. It achieves state-of-the-art performance on a comprehensive set of tasks, including RNA secondary structure prediction, mRNA-related tasks, RNA function prediction tasks, and RNA inverse folding.

description

## Model architectural details AIDO.RNA is an encoder-only transformer and is pre-trained using masked language modeling (MLM) objective. The model architecture parameters are as follows: | hyperparameter | value | | :---: | :----: | | num-layers | 32 | | hidden-size | 2,048 | | ffn-hidden-size | 5,440 | | num-attn-heads | 32 | | vocab-size | 16 | ## Pre-training data The pre-training data contains 42 million unique ncRNA sequences from RNAcentral version 24.0.

description

## Downstream evaluation

description

## How to Use Build any downstream models from this backbone ### Get RNA sequence embedding ```python from genbio_finetune.tasks import Embed model = Embed.from_config({"model.backbone": "rnafm"}).eval() collated_batch = model.collate({"sequences": ["ACGT", "ACGT"]}) embedding = model(collated_batch) print(embedding.shape) print(embedding) ``` ### Sequence-level classification ```python import torch from genbio_finetune.tasks import SequenceClassification model = SequenceClassification.from_config({"model.backbone": "rnafm", "model.n_classes": 2}).eval() collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]}) logits = model(collated_batch) print(logits) print(torch.argmax(logits, dim=-1)) ``` ### Token-level classification ```python import torch from genbio_finetune.tasks import TokenClassification model = TokenClassification.from_config({"model.backbone": "rnafm", "model.n_classes": 3}).eval() collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]}) logits = model(collated_batch) print(logits) print(torch.argmax(logits, dim=-1)) ``` ### Pairwise token-level classification @Sazan TODO ### Sequence-level regression ```python from genbio_finetune.tasks import SequenceRegression model = SequenceRegression.from_config({"model.backbone": "rnafm"}).eval() collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]}) logits = model(collated_batch) print(logits) ``` ## RNA inverse folding @Sazan TODO Or use our one-liner CLI to finetune or evaluate any of the above! ```bash gbft fit --model SequenceClassification --model.backbone rnafm --data SequenceClassification --data.path gbft test --model SequenceClassification --model.backbone rnafm --data SequenceClassification --data.path ``` For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator) ## Citation Please cite AIDO.RNA using the following BibTeX code: ``` @inproceedings{ zou2024a, title={A Large-Scale Foundation Model for {RNA} Function and Structure Prediction}, author={Shuxian Zou and Tianhua Tao and Sazan Mahbub and Caleb Ellington and Robin Jonathan Algayres and Dian Li and Yonghao Zhuang and Hongyi Wang and Le Song and Eric P. Xing}, booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities}, year={2024}, url={https://openreview.net/forum?id=Gzo3JMPY8w} } ``` ## License @Hongyi TODO