RNAMamba-14M

This model is a small Mamba based model trained from scratch on 1.96 million sequences (1.56 billion bases) extracted from RNAcentral's active sequences FASTA file for release 24 (March 2024).

This is intended to be a sequence embedding model for downstream processing of ncRNA sequences. It is trained with a masked language modelling objective, and a context size of 8,192 nucleotides. This particular model has the MLM head stripped off and so should be almost ready to use for embedding. The dataset has sequences ranging in length from 10 to 8192, so the model should be pretty good at handling sequences in that range. This is a deliberately small model with only 14.1 million parameters (8 hidden layers, hidden dim 512, intermediate size 1024) such that it will run fast without a GPU. We may train something bigger if it looks like these embeddings are not good enough.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 32
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Framework versions

Transformers 4.39.3
Pytorch 2.2.2+cu118
Datasets 2.18.0
Tokenizers 0.15.2

afg1
/

RNAMamba-14M

RNAMamba-14M

Training hyperparameters

Framework versions

Evaluation results