Instructions to use Taykhoom/AIDO.DNA-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/AIDO.DNA-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/AIDO.DNA-7B", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/AIDO.DNA-7B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
AIDO.DNA-7B
7B-parameter DNA foundation model from the AIDO (Artificial Intelligence-Driven Observatory) suite, trained on 10.6 billion nucleotides from 796 species. This is a standalone HuggingFace port that loads without the ModelGenerator package.
Architecture
| Parameter | Value |
|---|---|
| Layers | 32 |
| Attention heads | 32 |
| Embedding dimension | 4352 |
| Intermediate (MLP) size | 11584 |
| Vocabulary size | 16 |
| Positional encoding | RoPE (rotary_percent=1.0) |
| Normalization | LayerNorm |
| MLP activation | SwiGLU |
| Architecture | Pre-LN Transformer (BERT-style encoder) |
| Max sequence length | 4000 (training context; RoPE has no hard limit) |
Vocabulary: [PAD], [MASK], [CLS], [SEP], [UNK], A, G, C, T, U, N,
[BOS], [EOS], [UNUSED1], [UNUSED2], [UNUSED3]
DNA sequences use single-nucleotide tokenization over A, C, G, T, N. Each sequence is
wrapped as [CLS] ... [SEP].
Note on
U: the vocabulary is the shared AIDO RNABert vocabulary, so aUtoken exists (id 9) and the tokenizer will accept it. However, AIDO.DNA was pretrained on DNA (A, C, G, T, N) and never sawUduring training - its embedding row is effectively untrained (embedding norm ~1.77, in line with the unused special tokens, versus ~0.69-0.97 for the trained nucleotidesA/G/C/T). Do not feedUto this model; useTfor thymine. The token is retained only to keepvocab_size=16consistent with the original weights.
Pretraining
- Objective: Masked language modeling (MLM) on genomic DNA
- Data: 10.6B nucleotides from 796 genomes (Nucleotide Transformer dataset), single-nucleotide tokenization, 4000-nucleotide context
- Source checkpoint:
genbio-ai/AIDO.DNA-7B
Checkpoint selection
The 7B model is the largest released AIDO.DNA checkpoint and the most accurate for functional genomics, genome mining, and unsupervised annotation. For lighter-weight deployment use Taykhoom/AIDO.DNA-300M.
Parity Verification
Hidden-state representations compared against the original genbio-ai/AIDO.DNA-7B
weights (loaded into the genbio RNABertForMaskedLM reference) at all 33 representation
levels (embedding + 32 transformer layers). The embedding layer matches exactly, and the
final post-LayerNorm hidden state and MLM logits match within tight tolerance.
Intermediate layer differences are floating-point accumulation noise in the un-normalized
residual stream (relative error < 1e-6), normalized away by the final layer norm.
Attention weights over valid positions sum to 1, and padded keys receive zero probability.
Verified on PyTorch 2.7 / CUDA 12.
Related Models
See the full AIDO.DNA collection.
| Model | Parameters | Notes |
|---|---|---|
| Taykhoom/AIDO.DNA-300M | 300M | Smaller DNA variant |
| Taykhoom/AIDO.DNA-7B | 7B | This model |
Usage
Embedding generation
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/AIDO.DNA-7B", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/AIDO.DNA-7B", trust_remote_code=True)
model.eval()
sequences = ["ACGTACGTACGTACGT", "TTGCAACGTAGCTAGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 4352) -- CLS token
token_emb = out.last_hidden_state # (batch, seq_len, 4352)
# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3]
MLM logits
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/AIDO.DNA-7B", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/AIDO.DNA-7B", trust_remote_code=True)
model.eval()
enc = tokenizer(["ACGT[MASK]CGTA"], return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 16)
Fine-tuning
Standard HF conventions. Use cls_emb = out.last_hidden_state[:, 0, :] (CLS token) as
input to a task-specific head for sequence-level tasks.
Implementation Notes
The original genbio-ai/AIDO.DNA-7B checkpoint requires the
ModelGenerator package to load.
This port is a clean standalone re-implementation:
- All model logic is contained in
modeling_aidodna.pyandconfiguration_aidodna.py. attn_implementation="sdpa"andattn_implementation="flash_attention_2"are added (not present in the original genbio-ai implementation).- Architecture: pre-LN Transformer with SwiGLU MLP and RoPE positional embeddings,
identical to the AIDO.RNA family (
RNABertForMaskedLM). - Weights are stored as 6 sharded
model-0000X-of-00006.safetensorsfiles.
Citation
@inproceedings{ellington2024_aidodna,
title = {Accurate and General {DNA} Representations Emerge from Genome Foundation Models at Scale},
author = {Ellington, Caleb N. and Sun, Ning and Ho, Nicholas and Tao, Tianhua and Mahbub, Sazan and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Song, Le and Xing, Eric P.},
booktitle = {NeurIPS 2024 Workshop on AI for New Drug Modalities},
year = {2024},
doi = {10.1101/2024.12.01.625444}
}
Credits
Original model and code by Ellington et al. Source: GitHub. The HF conversion code was authored primarily by Claude and reviewed manually by Taykhoom Dalal.
License
GenBio AI Community License, following the original repository. See LICENSE for details.
- Downloads last month
- 15