Instructions to use Taykhoom/SpliceBERT-human-510nt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/SpliceBERT-human-510nt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Taykhoom/SpliceBERT-human-510nt", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-human-510nt", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
SpliceBERT-human-510nt
SpliceBERT is a BERT-based RNA language model pre-trained on primary RNA sequences using a masked language modeling (MLM) objective. This human-specific 510nt variant is trained exclusively on fixed-length 510 nt fragments from human mRNA sequences.
WARNING: This model requires exactly 510 nt of input (excluding [CLS] and [SEP]). Sequences shorter or longer than 510 nt may produce incorrect outputs without fine-tuning. For general-purpose RNA embedding, use SpliceBERT-1024nt instead.
Architecture
| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention heads | 16 |
| Embedding dimension | 512 |
| Intermediate dimension | 2048 |
| Vocabulary size | 10 |
| Positional encoding | Learned absolute |
| Architecture | BERT encoder |
| Max sequence length | 510 (fixed-length training) |
| Parameters | ~44M |
Vocabulary: [PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, N=5, A=6, C=7, G=8, T/U=9
Pretraining
- Objective: Masked language modeling (MLM)
- Data: Human primary RNA sequences
- Sequence format: Single-nucleotide tokenization with spaces; U converted to T; fixed 510 nt fragments
- Source checkpoint:
SpliceBERT-human.510nt/pytorch_model.bin(from zenodo:7995778)
Checkpoint selection
This human-only variant may outperform the multi-species 510nt model on human-specific splicing tasks. For cross-species generalization or variable-length sequences, use SpliceBERT-1024nt.
Parity Verification
Hidden-state representations verified (max abs diff < 1e-5) against the original
checkpoint at all 7 representation levels (embedding + 6 transformer layers),
for both eager and sdpa attention backends.
Verified on GPU with PyTorch 2.7 / CUDA 11.8.
Related Models
See the full SpliceBERT collection.
| Model | Context | Training data | Notes |
|---|---|---|---|
| SpliceBERT-1024nt | 1024 nt | 72 vertebrates | Variable-length; general purpose |
| SpliceBERT-510nt | 510 nt (fixed) | 72 vertebrates | Multi-species 510 nt |
| SpliceBERT-human-510nt | 510 nt (fixed) | Human only | This model |
Usage
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-human-510nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-human-510nt", trust_remote_code=True)
model.eval()
# Sequence must be exactly 510 nt; tokenizer handles U->T automatically
seq = ("ATCGATCG" * 64)[:510] # exactly 510 nt
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
out = model(**enc, output_hidden_states=True)
hidden = out.last_hidden_state[0] # (512, 512)
token_emb = hidden[1:-1] # strip [CLS] and [SEP] -> (510, 512)
mean_emb = token_emb.mean(dim=0) # (512,)
Fine-tuning
Standard HF conventions. For splice site prediction, token-level classification using all 510 token positions (excluding special tokens) is the typical setup.
Implementation Notes
The original checkpoint was saved as BertForMaskedLM with transformers==4.18.0.
This port uses BERT-updated, which
adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support
not present in the original codebase.
Citation
@article{chen2024_splicebert,
title = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
journal = {Briefings in Bioinformatics},
volume = {25},
number = {3},
pages = {bbae163},
year = {2024},
doi = {10.1093/bib/bbae163}
}
Credits
Original model and code by Chen et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.
License
MIT, following the original repository.
- Downloads last month
- 29