File size: 4,494 Bytes
75693bc
8e4c61d
7370af0
3423cbe
2a76b05
7f4d028
2a76b05
8e4c61d
f5d1e99
9525981
 
 
 
 
 
 
6b2c108
9525981
8e4c61d
f5d1e99
9525981
 
640296f
9525981
8e4c61d
f5d1e99
2a76b05
 
 
6e34ac7
a7d67c5
f5d1e99
c3b4f26
 
 
 
 
 
f5d1e99
c3b4f26
 
e70cf7a
c3b4f26
119a760
c3b4f26
f5d1e99
 
 
 
c3b4f26
75693bc
c3b4f26
 
119a760
75693bc
 
 
c3b4f26
75693bc
c3b4f26
e70cf7a
a7d67c5
c3b4f26
119a760
f5b4139
a7d67c5
 
 
 
c3b4f26
e70cf7a
c3b4f26
119a760
f5b4139
f5d1e99
 
119a760
f5d1e99
c3b4f26
 
 
 
 
 
 
 
f5d1e99
 
8e4c61d
f5d1e99
75fd21a
c3b4f26
 
 
 
 
 
 
75fd21a
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# AIDO.RNA-1.6B

AIDO.RNA-1.6B is a general-purpose RNA foundation model with 1.6 billion parameters, trained on 42 million non-coding RNA sequences at single-nucleotide resolution. It achieves state-of-the-art performance on a comprehensive set of tasks, including RNA secondary structure prediction, mRNA-related tasks, RNA function prediction, and RNA inverse folding. After domain adaptation, AIDO.RNA excels in modeling protein-level tasks, highlighting its potential to leverage the central dogma for enhancing biomolecular representations. For more detailed information, please refer to [our paper](https://www.biorxiv.org/content/10.1101/2024.11.28.625345v1).

<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/mNqn5SKQFHxSby3E2dosE.png" alt="description" style="width:80%; height:auto;">
</p>

## Model architectural details
AIDO.RNA is an encoder-only transformer and is pre-trained using masked language modeling (MLM) objective. The model architecture parameters are as follows:
|   hyperparameter  |  value     |
| :---:             |    :----:  |
| num-layers        | 32         |
| hidden-size       | 2,048      |
| ffn-hidden-size   | 5,440      |
| num-attn-heads    | 32         |
| vocab-size        | 16         |


## Pre-training data
The pre-training data contains 42 million unique ncRNA sequences from RNAcentral version 24.0. 
<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/EKvuUI9mBw5hkErzpXKm9.png" alt="description" style="width:90%; height:auto;">
</p>

## Downstream evaluation
<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/uvII1Q_1vDe95WCP1RgUV.png" alt="description" style="width:90%; height:auto;">
</p>


## How to Use
### Build any downstream models from this backbone with ModelGenerator
For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
```bash
mgen fit --model SequenceClassification --model.backbone aido_rna_1b600m --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
mgen test --model SequenceClassification --model.backbone aido_rna_1b600m --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
```

### Or use directly in Python
#### Embedding
```python
from modelgenerator.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)
```
#### Sequence-level Classification
```python
import torch
from modelgenerator.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "aido_rna_1b600m", "model.n_classes": 2}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Token-level Classification
```python
import torch
from modelgenerator.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "aido_rna_1b600m", "model.n_classes": 3}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Sequence-level Regression
```python
from modelgenerator.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
```

### Get RNA sequence embedding
```python
from genbio_finetune.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_rna_1b600m"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "ACGT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)
```

## Citation
Please cite AIDO.RNA using the following BibTeX code:
```
@misc{zou_large-scale_2024,
	title = {A Large-Scale Foundation Model for RNA Function and Structure Prediction},
	url = {https://www.biorxiv.org/content/10.1101/2024.11.28.625345v1},
	doi = {10.1101/2024.11.28.625345},
	publisher = {bioRxiv},
	author = {Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Ellington, Caleb N. and Algayres, Robin and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Song, Le and Xing, Eric P.},
	year = {2024},
}
```