SpliceBERT-human-510nt

SpliceBERT is a BERT-based RNA language model pre-trained on primary RNA sequences using a masked language modeling (MLM) objective. This human-specific 510nt variant is trained exclusively on fixed-length 510 nt fragments from human mRNA sequences.

WARNING: This model requires exactly 510 nt of input (excluding [CLS] and [SEP]). Sequences shorter or longer than 510 nt may produce incorrect outputs without fine-tuning. For general-purpose RNA embedding, use SpliceBERT-1024nt instead.

Architecture

Parameter Value
Layers 6
Attention heads 16
Embedding dimension 512
Intermediate dimension 2048
Vocabulary size 10
Positional encoding Learned absolute
Architecture BERT encoder
Max sequence length 510 (fixed-length training)
Parameters ~44M

Vocabulary: [PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, N=5, A=6, C=7, G=8, T/U=9

Pretraining

  • Objective: Masked language modeling (MLM)
  • Data: Human primary RNA sequences
  • Sequence format: Single-nucleotide tokenization with spaces; U converted to T; fixed 510 nt fragments
  • Source checkpoint: SpliceBERT-human.510nt/pytorch_model.bin (from zenodo:7995778)

Checkpoint selection

This human-only variant may outperform the multi-species 510nt model on human-specific splicing tasks. For cross-species generalization or variable-length sequences, use SpliceBERT-1024nt.

Parity Verification

Hidden-state representations verified (max abs diff < 1e-5) against the original checkpoint at all 7 representation levels (embedding + 6 transformer layers), for both eager and sdpa attention backends. Verified on GPU with PyTorch 2.7 / CUDA 11.8.

Related Models

See the full SpliceBERT collection.

Model Context Training data Notes
SpliceBERT-1024nt 1024 nt 72 vertebrates Variable-length; general purpose
SpliceBERT-510nt 510 nt (fixed) 72 vertebrates Multi-species 510 nt
SpliceBERT-human-510nt 510 nt (fixed) Human only This model

Usage

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-human-510nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-human-510nt", trust_remote_code=True)
model.eval()

# Sequence must be exactly 510 nt; tokenizer handles U->T automatically
seq = ("ATCGATCG" * 64)[:510]  # exactly 510 nt
enc = tokenizer(seq, return_tensors="pt")

with torch.no_grad():
    out = model(**enc, output_hidden_states=True)

hidden = out.last_hidden_state[0]  # (512, 512)
token_emb = hidden[1:-1]           # strip [CLS] and [SEP] -> (510, 512)
mean_emb = token_emb.mean(dim=0)   # (512,)

Fine-tuning

Standard HF conventions. For splice site prediction, token-level classification using all 510 token positions (excluding special tokens) is the typical setup.

Implementation Notes

The original checkpoint was saved as BertForMaskedLM with transformers==4.18.0. This port uses BERT-updated, which adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support not present in the original codebase.

Citation

@article{chen2024_splicebert,
  title   = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
  author  = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
  journal = {Briefings in Bioinformatics},
  volume  = {25},
  number  = {3},
  pages   = {bbae163},
  year    = {2024},
  doi     = {10.1093/bib/bbae163}
}

Credits

Original model and code by Chen et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.

Downloads last month
29
Safetensors
Model size
19.2M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/SpliceBERT-human-510nt