AIDO.DNA-300M

300M-parameter DNA foundation model from the AIDO (Artificial Intelligence-Driven Observatory) suite. This is a standalone HuggingFace port that loads without the ModelGenerator package.

Architecture

Parameter	Value
Layers	24
Attention heads	16
Embedding dimension	1024
Intermediate (MLP) size	2688
Vocabulary size	16
Positional encoding	RoPE (rotary_percent=1.0)
Normalization	LayerNorm
MLP activation	SwiGLU
Architecture	Pre-LN Transformer (BERT-style encoder)
Max sequence length	4000 (training context; RoPE has no hard limit)

Vocabulary: [PAD], [MASK], [CLS], [SEP], [UNK], A, G, C, T, U, N, [BOS], [EOS], [UNUSED1], [UNUSED2], [UNUSED3]

DNA sequences use single-nucleotide tokenization over A, C, G, T, N. Each sequence is wrapped as [CLS] ... [SEP].

Note on U: the vocabulary is the shared AIDO RNABert vocabulary, so a U token exists (id 9) and the tokenizer will accept it. However, AIDO.DNA was pretrained on DNA (A, C, G, T, N) and never saw U during training - its embedding row is effectively untrained (embedding norm ~1.77, in line with the unused special tokens, versus ~0.69-0.97 for the trained nucleotides A/G/C/T). Do not feed U to this model; use T for thymine. The token is retained only to keep vocab_size=16 consistent with the original weights.

Pretraining

Objective: Masked language modeling (MLM) on genomic DNA
Data: Genomes from the Nucleotide Transformer dataset (single-nucleotide tokens)
Source checkpoint: genbio-ai/AIDO.DNA-300M

Checkpoint selection

The 300M model is the smaller of the two released AIDO.DNA checkpoints, suitable for fast embedding generation and fine-tuning on modest hardware. For maximum accuracy use Taykhoom/AIDO.DNA-7B.

Parity Verification

Hidden-state representations compared against the original genbio-ai/AIDO.DNA-300M weights (loaded into the genbio RNABertForMaskedLM reference) at all 25 representation levels (embedding + 24 transformer layers). The embedding layer matches exactly, and the final post-LayerNorm hidden state and MLM logits match within 4e-6. Intermediate layer differences (up to ~2e-4) are floating-point accumulation noise in the un-normalized residual stream, normalized away by the final layer norm. Verified on PyTorch 2.7 / CUDA 12.

Related Models

See the full AIDO.DNA collection.

Model	Parameters	Notes
Taykhoom/AIDO.DNA-300M	300M	This model
Taykhoom/AIDO.DNA-7B	7B	Largest DNA variant

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/AIDO.DNA-300M", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/AIDO.DNA-300M", trust_remote_code=True)
model.eval()

sequences = ["ACGTACGTACGTACGT", "TTGCAACGTAGCTAGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 1024) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 1024)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3]

MLM logits

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/AIDO.DNA-300M", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/AIDO.DNA-300M", trust_remote_code=True)
model.eval()

enc = tokenizer(["ACGT[MASK]CGTA"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 16)

Fine-tuning

Standard HF conventions. Use cls_emb = out.last_hidden_state[:, 0, :] (CLS token) as input to a task-specific head for sequence-level tasks.

Implementation Notes

The original genbio-ai/AIDO.DNA-300M checkpoint requires the ModelGenerator package to load. This port is a clean standalone re-implementation:

All model logic is contained in modeling_aidodna.py and configuration_aidodna.py.
attn_implementation="sdpa" and attn_implementation="flash_attention_2" are added (not present in the original genbio-ai implementation).
Architecture: pre-LN Transformer with SwiGLU MLP and RoPE positional embeddings, identical to the AIDO.RNA family (RNABertForMaskedLM).

Citation

@inproceedings{ellington2024_aidodna,
  title   = {Accurate and General {DNA} Representations Emerge from Genome Foundation Models at Scale},
  author  = {Ellington, Caleb N. and Sun, Ning and Ho, Nicholas and Tao, Tianhua and Mahbub, Sazan and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Song, Le and Xing, Eric P.},
  booktitle = {NeurIPS 2024 Workshop on AI for New Drug Modalities},
  year    = {2024},
  doi     = {10.1101/2024.12.01.625444}
}

Credits

Original model and code by Ellington et al. Source: GitHub. The HF conversion code was authored primarily by Claude and reviewed manually by Taykhoom Dalal.

License

GenBio AI Community License, following the original repository. See LICENSE for details.

Downloads last month: 29

Safetensors

Model size

0.3B params

Tensor type

F32

Collection including Taykhoom/AIDO.DNA-300M

AIDO.DNA

Collection

HF ports of AIDO.DNA: 300 million and 7 billion parameter versions. • 2 items • Updated 7 days ago