DNABERT-2

Weights and tokenizer for DNABERT-2 (Zhou et al., arXiv 2023), loaded with the shared MosaicBERT implementation from Taykhoom/MosaicBERT-updated.

DNABERT-2 is a foundation model trained on large-scale multi-species genome data. It replaces k-mer tokenization with Byte Pair Encoding (BPE), uses ALiBi positional biases instead of learned embeddings, and incorporates a GLU-based FFN for improved efficiency.

This repo contains only weights and tokenizer files. The model code is loaded automatically from Taykhoom/MosaicBERT-updated via trust_remote_code=True.

Architecture

Parameter Value
Layers 12
Attention heads 12
Embedding dimension 768
Intermediate size 3072
Vocabulary size 4096 (BPE)
Positional encoding ALiBi (no hard length limit)
Max sequence length ~10000 nt (practical; ALiBi resizes dynamically)
Parameters ~117M

Tokenization

Uses Byte Pair Encoding (BPE) tokenization via PreTrainedTokenizerFast. No k-mer pre-processing required.

Pretraining

  • Objective: Masked Language Modeling
  • Data: Large-scale multi-species genome (GRCh38 and others)
  • Source checkpoint: pytorch_model.bin from zhihan1996/DNABERT-2-117M

Parity Verification

Hidden-state representations verified identical (max abs diff = 0.00) to the original implementation at all 13 representation levels (embedding + 12 transformer layers). SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.

Related Models

See the full DNABERT collection.

Model Architecture Notes
DNABERT-3mer BERT + k-mer k=3
DNABERT-4mer BERT + k-mer k=4
DNABERT-5mer BERT + k-mer k=5
DNABERT-6mer BERT + k-mer k=6
DNABERT-2 MosaicBERT + BPE + ALiBi This model
DNABERT-S MosaicBERT + BPE + ALiBi Species-aware contrastive fine-tune

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model.eval()

sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb  = out.last_hidden_state[:, 0, :]   # (batch, 768)
mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]

MLM logits

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model.eval()

enc = tokenizer(["ACGTAGCAT[MASK]GGATCTATC"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 4096)

Attention implementation

# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
                                   attn_implementation="flash_attention_2",
                                   torch_dtype=torch.bfloat16)

Implementation Notes

The original DNABERT-2 codebase uses a Triton-based flash attention implementation (flash_attn_triton.py). This HF port uses Taykhoom/MosaicBERT-updated which replaces it with the standard flash-attn package, and also adds attn_implementation="sdpa" support. These were not part of the original codebase.

Citation

@misc{zhou2023_dnabert2,
  title   = {{DNABERT}-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
  author  = {Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and
             Davuluri, Ramana and Liu, Han},
  year    = {2023},
  eprint  = {2306.15006},
  archivePrefix = {arXiv},
  primaryClass  = {q-bio.GN}
}

Credits

Original DNABERT-2 model and code by Zhou et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.

Downloads last month
17
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/DNABERT2

Paper for Taykhoom/DNABERT2