Instructions to use Taykhoom/DNABERT2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/DNABERT2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/DNABERT2", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
DNABERT-2
Weights and tokenizer for DNABERT-2 (Zhou et al., arXiv 2023), loaded with the shared MosaicBERT implementation from Taykhoom/MosaicBERT-updated.
DNABERT-2 is a foundation model trained on large-scale multi-species genome data. It replaces k-mer tokenization with Byte Pair Encoding (BPE), uses ALiBi positional biases instead of learned embeddings, and incorporates a GLU-based FFN for improved efficiency.
This repo contains only weights and tokenizer files. The model code is loaded
automatically from Taykhoom/MosaicBERT-updated via trust_remote_code=True.
Architecture
| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 4096 (BPE) |
| Positional encoding | ALiBi (no hard length limit) |
| Max sequence length | ~10000 nt (practical; ALiBi resizes dynamically) |
| Parameters | ~117M |
Tokenization
Uses Byte Pair Encoding (BPE) tokenization via PreTrainedTokenizerFast.
No k-mer pre-processing required.
Pretraining
- Objective: Masked Language Modeling
- Data: Large-scale multi-species genome (GRCh38 and others)
- Source checkpoint:
pytorch_model.binfrom zhihan1996/DNABERT-2-117M
Parity Verification
Hidden-state representations verified identical (max abs diff = 0.00) to the original implementation at all 13 representation levels (embedding + 12 transformer layers). SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.
Related Models
See the full DNABERT collection.
| Model | Architecture | Notes |
|---|---|---|
| DNABERT-3mer | BERT + k-mer | k=3 |
| DNABERT-4mer | BERT + k-mer | k=4 |
| DNABERT-5mer | BERT + k-mer | k=5 |
| DNABERT-6mer | BERT + k-mer | k=6 |
| DNABERT-2 | MosaicBERT + BPE + ALiBi | This model |
| DNABERT-S | MosaicBERT + BPE + ALiBi | Species-aware contrastive fine-tune |
Usage
Embedding generation
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model.eval()
sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768)
mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling
# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]
MLM logits
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model.eval()
enc = tokenizer(["ACGTAGCAT[MASK]GGATCTATC"], return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 4096)
Attention implementation
# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
attn_implementation="sdpa")
# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16)
Implementation Notes
The original DNABERT-2 codebase uses a Triton-based flash attention implementation
(flash_attn_triton.py). This HF port uses
Taykhoom/MosaicBERT-updated
which replaces it with the standard flash-attn package, and also adds
attn_implementation="sdpa" support. These were not part of the original codebase.
Citation
@misc{zhou2023_dnabert2,
title = {{DNABERT}-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
author = {Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and
Davuluri, Ramana and Liu, Han},
year = {2023},
eprint = {2306.15006},
archivePrefix = {arXiv},
primaryClass = {q-bio.GN}
}
Credits
Original DNABERT-2 model and code by Zhou et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.
License
MIT, following the original repository.
- Downloads last month
- 17