SpliceBERT-1024nt

SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate primary RNA sequences using a masked language modeling (MLM) objective. The 1024nt variant is trained on variable-length fragments (64-1024 nt) from 72 vertebrates.

Architecture

Parameter Value
Layers 6
Attention heads 16
Embedding dimension 512
Intermediate dimension 2048
Vocabulary size 10
Positional encoding Learned absolute
Architecture BERT encoder
Max sequence length 1024
Parameters ~44M

Vocabulary: [PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, N=5, A=6, C=7, G=8, T/U=9

Pretraining

  • Objective: Masked language modeling (MLM)
  • Data: >2 million vertebrate primary RNA sequences from 72 species
  • Sequence format: Single-nucleotide tokenization with spaces; U converted to T
  • Source checkpoint: SpliceBERT.1024nt/pytorch_model.bin (from zenodo:7995778)

Checkpoint selection

The 1024nt variant is the primary SpliceBERT model trained on variable-length vertebrate sequences. Use this variant for general-purpose RNA embedding. The 510nt variants are trained on fixed-length fragments and require exact 510nt input.

Parity Verification

Hidden-state representations verified (max abs diff < 1e-5) against the original checkpoint at all 7 representation levels (embedding + 6 transformer layers), for both eager and sdpa attention backends. Verified on GPU with PyTorch 2.7 / CUDA 11.8.

Related Models

See the full SpliceBERT collection.

Model Context Training data Notes
SpliceBERT-1024nt 1024 nt 72 vertebrates This model
SpliceBERT-510nt 510 nt (fixed) 72 vertebrates Fixed-length; requires exact 510 nt input
SpliceBERT-human-510nt 510 nt (fixed) Human only Human-specific; requires exact 510 nt input

Usage

Embedding generation

The tokenizer automatically handles U->T conversion and single-nucleotide spacing. Pass raw sequences directly.

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()

seq = "ACGUACGUACGUACGU"  # U->T handled automatically
enc = tokenizer(seq, return_tensors="pt")

with torch.no_grad():
    out = model(**enc, output_hidden_states=True)

# Mean pooling over non-special tokens
hidden = out.last_hidden_state[0]  # (seq_len+2, 512)
token_emb = hidden[1:-1]           # strip [CLS] and [SEP]
mean_emb = token_emb.mean(dim=0)   # (512,)

# Intermediate layers
layer3_emb = out.hidden_states[3]  # (1, seq_len+2, 512)

MLM logits

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()

seq = "A C G [MASK] A C G T"
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits  # (1, seq_len, 10)

Fine-tuning

Standard HF conventions. For sequence-level tasks, use mean pooling of non-special token positions (positions 1 to -1) as input to a prediction head.

Implementation Notes

The original checkpoint was saved as BertForMaskedLM with transformers==4.24.0. This port uses BERT-updated, which adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support not present in the original codebase.

model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt",
                                  trust_remote_code=True,
                                  attn_implementation="sdpa")

Citation

@article{chen2024_splicebert,
  title   = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
  author  = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
  journal = {Briefings in Bioinformatics},
  volume  = {25},
  number  = {3},
  pages   = {bbae163},
  year    = {2024},
  doi     = {10.1093/bib/bbae163}
}

Credits

Original model and code by Chen et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.

Downloads last month
29
Safetensors
Model size
19.4M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/SpliceBERT-1024nt