YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🧬 DNA-Bio-LM-Codon-TPU

A Biologically Conditioned DNA Language Model (3-mer Codon Transformer)


πŸ” Overview

DNA-Bio-LM-Codon-TPU is a transformer-based language model trained from scratch on real genomic sequences using biologically meaningful representations.

Unlike traditional DNA models, this model introduces:

  • 3-mer (codon) tokenization
  • Task-conditioned sequence generation
  • Biological validation metrics

The model is trained on curated datasets from InstaDeepAI genomics benchmarks and focuses on learning real biological distributions rather than synthetic approximations.


🧠 Key Innovations

🧬 1. Codon-Level Tokenization

  • Uses 3-mer tokens (64 codons) instead of single nucleotides

  • Captures:

    • Start codons (ATG)
    • Stop codons (TAA, TAG, TGA)
    • Regulatory patterns

πŸ‘‰ Treats DNA as a biological language, not raw characters


🎯 2. Biological Task Conditioning

Sequences are conditioned with task tokens:

Token Meaning
<ENHANCER> Enhancer regions
<PROMOTER> Promoter regions
<SPLICE> Splice sites
<HISTONE> Histone modification regions

πŸ‘‰ Enables controlled DNA generation


🚫 3. No Synthetic Data

  • Only real genomic sequences used

  • Strict filtering:

    • Valid bases: A, T, G, C, N
    • Minimum length constraints

πŸ‘‰ Prevents learning fake biological patterns


πŸ“Š 4. Biological Evaluation Metrics

Model is evaluated using:

  • GC content distribution
  • ATG (start codon) frequency
  • Stop codon frequency
  • TATA box motif detection
  • CpG island frequency
  • Dinucleotide KL divergence

πŸ‘‰ Measures biological realism, not just loss


πŸ—οΈ Model Architecture

  • Base Architecture: GPT-style Transformer
  • Framework: PyTorch
  • Library: Hugging Face Transformers

πŸ”§ Configuration

Parameter Value
Layers 12
Hidden Size 512
Attention Heads 8
FFN Size 2048
Max Sequence Length 256 codons
Vocab Size 73 tokens

πŸ“‚ Training Data

Source

  • InstaDeepAI nucleotide transformer datasets
  • Genomics long-range benchmarks

Processing

  • Sequences converted to codon tokens
  • Task labels mapped to conditioning tokens
  • Chunked into fixed-length sequences

βš™οΈ Training Setup

Component Value
Hardware TPU v5e-8 (Kaggle)
Precision Mixed precision
Optimizer AdamW
Scheduler Cosine decay
Learning Rate 3e-4
Batch Size 16
Gradient Accumulation 8
Epochs 5

πŸ“ˆ Training Behavior

  • Initial loss β‰ˆ log(73) β‰ˆ 4.3

  • Target:

    • Loss < 2.0 by epoch 3
  • Perplexity used as auxiliary metric


πŸ§ͺ Evaluation

Metrics

Metric Purpose
Loss Training convergence
Perplexity Sequence prediction quality
GC content Biological plausibility
Dinucleotide KL Distribution similarity

Biological Validation

The model compares generated sequences against real DNA:

  • Lower KL divergence β†’ better biological realism
  • Motif frequencies compared to ground truth

πŸš€ Usage

πŸ”Ή Load Model

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "<your-username>/dna-bio-lm-codon-tpu"
)

πŸ”Ή Load Vocabulary

import json

with open("codon_vocab.json") as f:
    vocab = json.load(f)

πŸ”Ή Generate DNA

# Example: Generate promoter sequence
task_token = vocab["<PROMOTER>"]
input_ids = [task_token, vocab["<bos>"]]

# Feed into model.generate()

🧬 Capabilities

  • Generate biologically realistic DNA sequences
  • Learn codon-level dependencies
  • Capture motif patterns
  • Condition generation on biological tasks

⚠️ Limitations

  • Not a full gene expression predictor
  • No protein translation modeling
  • Limited to sequence-level patterns
  • Requires biological validation for real-world use

⚠️ Risks & Ethical Considerations

  • Generated DNA may resemble real sequences

  • Not suitable for:

    • clinical decisions
    • genetic engineering
  • Must be used for research purposes only


🌍 Environmental Impact

  • Hardware: TPU v5e-8
  • Platform: Kaggle
  • Training duration: Several hours
  • Mixed precision reduces energy usage

πŸ”¬ Technical Insights

  • Codon tokenization reduces sequence length by 3Γ—
  • Improves attention efficiency
  • Enables larger context modeling

πŸ“š Citation

@misc{dna_bio_lm_2026,
  title={DNA-Bio-LM-Codon-TPU},
  author={<your-name>},
  year={2026},
  note={Biologically conditioned DNA language model with codon tokenization}
}

πŸ“¬ Contact


πŸ™ Acknowledgements

  • InstaDeepAI
  • Hugging Face
  • Open genomics research community

πŸ”₯ Summary

This model represents a shift from:

"DNA as text" β†’ "DNA as structured biological language"

and introduces a more biologically grounded approach to genomic language modeling.

Downloads last month
31
Safetensors
Model size
38M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support