Instructions to use Taykhoom/SpliceBERT-1024nt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/SpliceBERT-1024nt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
SpliceBERT-1024nt
SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate primary RNA sequences using a masked language modeling (MLM) objective. The 1024nt variant is trained on variable-length fragments (64-1024 nt) from 72 vertebrates.
Architecture
| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention heads | 16 |
| Embedding dimension | 512 |
| Intermediate dimension | 2048 |
| Vocabulary size | 10 |
| Positional encoding | Learned absolute |
| Architecture | BERT encoder |
| Max sequence length | 1024 |
| Parameters | ~44M |
Vocabulary: [PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, N=5, A=6, C=7, G=8, T/U=9
Pretraining
- Objective: Masked language modeling (MLM)
- Data: >2 million vertebrate primary RNA sequences from 72 species
- Sequence format: Single-nucleotide tokenization with spaces; U converted to T
- Source checkpoint:
SpliceBERT.1024nt/pytorch_model.bin(from zenodo:7995778)
Checkpoint selection
The 1024nt variant is the primary SpliceBERT model trained on variable-length vertebrate sequences. Use this variant for general-purpose RNA embedding. The 510nt variants are trained on fixed-length fragments and require exact 510nt input.
Parity Verification
Hidden-state representations verified (max abs diff < 1e-5) against the original
checkpoint at all 7 representation levels (embedding + 6 transformer layers),
for both eager and sdpa attention backends.
Verified on GPU with PyTorch 2.7 / CUDA 11.8.
Related Models
See the full SpliceBERT collection.
| Model | Context | Training data | Notes |
|---|---|---|---|
| SpliceBERT-1024nt | 1024 nt | 72 vertebrates | This model |
| SpliceBERT-510nt | 510 nt (fixed) | 72 vertebrates | Fixed-length; requires exact 510 nt input |
| SpliceBERT-human-510nt | 510 nt (fixed) | Human only | Human-specific; requires exact 510 nt input |
Usage
Embedding generation
The tokenizer automatically handles U->T conversion and single-nucleotide spacing. Pass raw sequences directly.
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()
seq = "ACGUACGUACGUACGU" # U->T handled automatically
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
out = model(**enc, output_hidden_states=True)
# Mean pooling over non-special tokens
hidden = out.last_hidden_state[0] # (seq_len+2, 512)
token_emb = hidden[1:-1] # strip [CLS] and [SEP]
mean_emb = token_emb.mean(dim=0) # (512,)
# Intermediate layers
layer3_emb = out.hidden_states[3] # (1, seq_len+2, 512)
MLM logits
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()
seq = "A C G [MASK] A C G T"
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 10)
Fine-tuning
Standard HF conventions. For sequence-level tasks, use mean pooling of non-special token positions (positions 1 to -1) as input to a prediction head.
Implementation Notes
The original checkpoint was saved as BertForMaskedLM with transformers==4.24.0.
This port uses BERT-updated, which
adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support
not present in the original codebase.
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt",
trust_remote_code=True,
attn_implementation="sdpa")
Citation
@article{chen2024_splicebert,
title = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
journal = {Briefings in Bioinformatics},
volume = {25},
number = {3},
pages = {bbae163},
year = {2024},
doi = {10.1093/bib/bbae163}
}
Credits
Original model and code by Chen et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.
License
MIT, following the original repository.
- Downloads last month
- 29