Model Overview
Orthrus is a mature RNA foundation model for RNA property prediction. Orthrus is pre-trained using contrastive learning on 45M+ mature RNA transcripts to capture functional and evolutionary relationships across all Mammailian organisms. Orthrus is built on a Mamba encoder backbone, enabling the embedding of arbitrarily long RNA sequence data. We offer two sizes of Orthrus: base
is trained using ~1M parameters, and large
is trained using ~10M parameters.
Two versions of Orthrus are available for use via HuggingFace (See collection):
- Orthrus base 4-track: Encodes the mRNA sequence with a simplified one-hot approach.
- Orthrus large 6-track: Adds biological context by including splice site indicators and coding sequence markers, which is crucial for accurate mRNA property prediction such as RNA half-life, ribosome load, and exon junction detection.
This HF repo contains the orthrus-large-6-track
model.
Additional project files and the github repository can be found at:
Using Orthrus (6-track)
To generate embeddings using Orthrus for spliced mature RNA sequences, follow the steps below:
NOTE: Orthrus was trained and built to model full mature RNA sequences, so using incomplete pieces of spliced RNA as input will be out of distribution. This differs in usage compared to existing DNA / RNA foundation models which model arbitrary genomic segments.
Create and Set Up the Environment
This environment setup is tested for using PyTorch 2.2.2 using CUDA 12.1.
- Setup conda environment
conda create --name orthrus
conda activate orthrus
- Install required dependencies
pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install causal_conv1d==1.2.0.post2
pip install mamba-ssm==1.2.0.post1
pip install transformers
- Install GenomeKit (Optional)
wget -O starter_build.sh https://raw.githubusercontent.com/deepgenomics/GenomeKit/v6.0.3/starter/build.sh
chmod +x starter_build.sh
./starter_build.sh
Load Orthrus from HuggingFace
import torch
from transformers import AutoModel
device = torch.device("cuda")
orthrus_6 = AutoModel.from_pretrained(
"quietflamingo/orthrus-large-6-track",
trust_remote_code=True
).to(device)
Get CDS / Splice Track
Orthrus 6-track requires CDS and Splicing track information. These are binary 1-d arrays that denote the position of the first nucleotide in each codon in the CDS, and the location of 5' splice sites, respectively. See the main GitHub repo for a guide on how to do this using GenomeKit. For this example, we will just use dummy values.
import numpy as np
# Fake cds / splice track for example purposes.
cds = np.array([0, 0, 0, 1, 0, 0, 0, 0, 0])
splice = np.array([0, 1, 0, 0, 0, 0, 0, 1, 0])
Get Sequence Embeddings
sequence = "ATGATGATG"
seq_ohe = orthrus_6.seq_to_oh(sequence).numpy()
# Combine input tracks
model_input = np.hstack((
seq_ohe,
cds.reshape(-1, 1),
splice.reshape(-1, 1)
))
model_input_tt = torch.Tensor(model_input).to(device)
model_input_tt = model_input_tt.unsqueeze(0)
lengths = torch.Tensor([model_input_tt.shape[1]]).to(device)
embedding = orthrus_6.representation(
model_input_tt, # (1 x L x 6)
lengths, # (1,)
channel_last=True
)
print(embedding.shape) # (1 x 512)
An example of sequence embedding using Orthrus is shown in this Colab notebook.
Citation
@article{orthrus_fradkin_shi_2024,
title = {Orthrus: Towards Evolutionary and Functional RNA Foundation Models},
url = {http://dx.doi.org/10.1101/2024.10.10.617658},
DOI = {10.1101/2024.10.10.617658},
publisher = {Cold Spring Harbor Laboratory},
author = {Fradkin, Philip and Shi, Ruian and Isaev, Keren and Frey, Brendan J and Morris, Quaid and Lee, Leo J and Wang, Bo},
year = {2024},
month = oct
}
- Downloads last month
- 284