Model Overview

Orthrus is a mature RNA foundation model for RNA property prediction. Orthrus is pre-trained using contrastive learning on 45M+ mature RNA transcripts to capture functional and evolutionary relationships across all Mammailian organisms. Orthrus is built on a Mamba encoder backbone, enabling the embedding of arbitrarily long RNA sequence data. We offer two sizes of Orthrus: base is trained using ~1M parameters, and large is trained using ~10M parameters.

Two versions of Orthrus are available for use via HuggingFace (See collection):

  • Orthrus base 4-track: Encodes the mRNA sequence with a simplified one-hot approach.
  • Orthrus large 6-track: Adds biological context by including splice site indicators and coding sequence markers, which is crucial for accurate mRNA property prediction such as RNA half-life, ribosome load, and exon junction detection.

This HF repo contains the orthrus-large-6-track model.

Additional project files and the github repository can be found at:

Using Orthrus (6-track)

To generate embeddings using Orthrus for spliced mature RNA sequences, follow the steps below:

NOTE: Orthrus was trained and built to model full mature RNA sequences, so using incomplete pieces of spliced RNA as input will be out of distribution. This differs in usage compared to existing DNA / RNA foundation models which model arbitrary genomic segments.

Create and Set Up the Environment

This environment setup is tested for using PyTorch 2.2.2 using CUDA 12.1.

  1. Setup conda environment
conda create --name orthrus
conda activate orthrus
  1. Install required dependencies
pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install causal_conv1d==1.2.0.post2
pip install mamba-ssm==1.2.0.post1

pip install transformers
  1. Install GenomeKit (Optional)
wget -O starter_build.sh https://raw.githubusercontent.com/deepgenomics/GenomeKit/v6.0.3/starter/build.sh
chmod +x starter_build.sh
./starter_build.sh

Load Orthrus from HuggingFace

import torch
from transformers import AutoModel

device = torch.device("cuda")

orthrus_6 = AutoModel.from_pretrained(
    "quietflamingo/orthrus-large-6-track",
    trust_remote_code=True
).to(device)

Get CDS / Splice Track

Orthrus 6-track requires CDS and Splicing track information. These are binary 1-d arrays that denote the position of the first nucleotide in each codon in the CDS, and the location of 5' splice sites, respectively. See the main GitHub repo for a guide on how to do this using GenomeKit. For this example, we will just use dummy values.

import numpy as np

# Fake cds / splice track for example purposes.
cds = np.array([0, 0, 0, 1, 0, 0, 0, 0, 0])
splice = np.array([0, 1, 0, 0, 0, 0, 0, 1, 0])

Get Sequence Embeddings

sequence = "ATGATGATG"

seq_ohe = orthrus_6.seq_to_oh(sequence).numpy()

# Combine input tracks
model_input = np.hstack((
    seq_ohe,
    cds.reshape(-1, 1),
    splice.reshape(-1, 1)
))

model_input_tt = torch.Tensor(model_input).to(device)
model_input_tt = model_input_tt.unsqueeze(0)

lengths = torch.Tensor([model_input_tt.shape[1]]).to(device)

embedding = orthrus_6.representation(
    model_input_tt,  # (1 x L x 6)
    lengths,  # (1,)
    channel_last=True
)

print(embedding.shape)  # (1 x 512)

An example of sequence embedding using Orthrus is shown in this Colab notebook.

Citation

@article{orthrus_fradkin_shi_2024,
  title = {Orthrus: Towards Evolutionary and Functional RNA Foundation Models},
  url = {http://dx.doi.org/10.1101/2024.10.10.617658},
  DOI = {10.1101/2024.10.10.617658},
  publisher = {Cold Spring Harbor Laboratory},
  author = {Fradkin,  Philip and Shi,  Ruian and Isaev,  Keren and Frey,  Brendan J and Morris,  Quaid and Lee,  Leo J and Wang,  Bo},
  year = {2024},
  month = oct 
}
Downloads last month
284
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Collection including quietflamingo/orthrus-large-6-track