File size: 4,494 Bytes
75693bc 8e4c61d 7370af0 3423cbe 2a76b05 7f4d028 2a76b05 8e4c61d f5d1e99 9525981 6b2c108 9525981 8e4c61d f5d1e99 9525981 640296f 9525981 8e4c61d f5d1e99 2a76b05 6e34ac7 a7d67c5 f5d1e99 c3b4f26 f5d1e99 c3b4f26 e70cf7a c3b4f26 119a760 c3b4f26 f5d1e99 c3b4f26 75693bc c3b4f26 119a760 75693bc c3b4f26 75693bc c3b4f26 e70cf7a a7d67c5 c3b4f26 119a760 f5b4139 a7d67c5 c3b4f26 e70cf7a c3b4f26 119a760 f5b4139 f5d1e99 119a760 f5d1e99 c3b4f26 f5d1e99 8e4c61d f5d1e99 75fd21a c3b4f26 75fd21a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
# AIDO.RNA-1.6B
AIDO.RNA-1.6B is a general-purpose RNA foundation model with 1.6 billion parameters, trained on 42 million non-coding RNA sequences at single-nucleotide resolution. It achieves state-of-the-art performance on a comprehensive set of tasks, including RNA secondary structure prediction, mRNA-related tasks, RNA function prediction, and RNA inverse folding. After domain adaptation, AIDO.RNA excels in modeling protein-level tasks, highlighting its potential to leverage the central dogma for enhancing biomolecular representations. For more detailed information, please refer to [our paper](https://www.biorxiv.org/content/10.1101/2024.11.28.625345v1).
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/mNqn5SKQFHxSby3E2dosE.png" alt="description" style="width:80%; height:auto;">
</p>
## Model architectural details
AIDO.RNA is an encoder-only transformer and is pre-trained using masked language modeling (MLM) objective. The model architecture parameters are as follows:
| hyperparameter | value |
| :---: | :----: |
| num-layers | 32 |
| hidden-size | 2,048 |
| ffn-hidden-size | 5,440 |
| num-attn-heads | 32 |
| vocab-size | 16 |
## Pre-training data
The pre-training data contains 42 million unique ncRNA sequences from RNAcentral version 24.0.
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/EKvuUI9mBw5hkErzpXKm9.png" alt="description" style="width:90%; height:auto;">
</p>
## Downstream evaluation
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/uvII1Q_1vDe95WCP1RgUV.png" alt="description" style="width:90%; height:auto;">
</p>
## How to Use
### Build any downstream models from this backbone with ModelGenerator
For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
```bash
mgen fit --model SequenceClassification --model.backbone aido_rna_1b600m --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
mgen test --model SequenceClassification --model.backbone aido_rna_1b600m --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
```
### Or use directly in Python
#### Embedding
```python
from modelgenerator.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)
```
#### Sequence-level Classification
```python
import torch
from modelgenerator.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "aido_rna_1b600m", "model.n_classes": 2}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Token-level Classification
```python
import torch
from modelgenerator.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "aido_rna_1b600m", "model.n_classes": 3}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Sequence-level Regression
```python
from modelgenerator.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
```
### Get RNA sequence embedding
```python
from genbio_finetune.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "ACGT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)
```
## Citation
Please cite AIDO.RNA using the following BibTeX code:
```
@misc{zou_large-scale_2024,
title = {A Large-Scale Foundation Model for RNA Function and Structure Prediction},
url = {https://www.biorxiv.org/content/10.1101/2024.11.28.625345v1},
doi = {10.1101/2024.11.28.625345},
publisher = {bioRxiv},
author = {Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Ellington, Caleb N. and Algayres, Robin and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Song, Le and Xing, Eric P.},
year = {2024},
}
```
|