YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🧬 DNA-Bio-LM-Codon-TPU

A Biologically Conditioned DNA Language Model (3-mer Codon Transformer)

🔍 Overview

DNA-Bio-LM-Codon-TPU is a transformer-based language model trained from scratch on real genomic sequences using biologically meaningful representations.

Unlike traditional DNA models, this model introduces:

3-mer (codon) tokenization
Task-conditioned sequence generation
Biological validation metrics

The model is trained on curated datasets from InstaDeepAI genomics benchmarks and focuses on learning real biological distributions rather than synthetic approximations.

🧠 Key Innovations

🧬 1. Codon-Level Tokenization

Uses 3-mer tokens (64 codons) instead of single nucleotides
Captures:
- Start codons (ATG)
- Stop codons (TAA, TAG, TGA)
- Regulatory patterns

👉 Treats DNA as a biological language, not raw characters

🎯 2. Biological Task Conditioning

Sequences are conditioned with task tokens:

Token	Meaning
`<ENHANCER>`	Enhancer regions
`<PROMOTER>`	Promoter regions
`<SPLICE>`	Splice sites
`<HISTONE>`	Histone modification regions

👉 Enables controlled DNA generation

🚫 3. No Synthetic Data

Only real genomic sequences used
Strict filtering:
- Valid bases: A, T, G, C, N
- Minimum length constraints

👉 Prevents learning fake biological patterns

📊 4. Biological Evaluation Metrics

Model is evaluated using:

GC content distribution
ATG (start codon) frequency
Stop codon frequency
TATA box motif detection
CpG island frequency
Dinucleotide KL divergence

👉 Measures biological realism, not just loss

🏗️ Model Architecture

Base Architecture: GPT-style Transformer
Framework: PyTorch
Library: Hugging Face Transformers

🔧 Configuration

Parameter	Value
Layers	12
Hidden Size	512
Attention Heads	8
FFN Size	2048
Max Sequence Length	256 codons
Vocab Size	73 tokens

📂 Training Data

Source

InstaDeepAI nucleotide transformer datasets
Genomics long-range benchmarks

Processing

Sequences converted to codon tokens
Task labels mapped to conditioning tokens
Chunked into fixed-length sequences

⚙️ Training Setup

Component	Value
Hardware	TPU v5e-8 (Kaggle)
Precision	Mixed precision
Optimizer	AdamW
Scheduler	Cosine decay
Learning Rate	3e-4
Batch Size	16
Gradient Accumulation	8
Epochs	5

📈 Training Behavior

Initial loss ≈ log(73) ≈ 4.3
Target:
- Loss < 2.0 by epoch 3
Perplexity used as auxiliary metric

🧪 Evaluation

Metrics

Metric	Purpose
Loss	Training convergence
Perplexity	Sequence prediction quality
GC content	Biological plausibility
Dinucleotide KL	Distribution similarity

Biological Validation

The model compares generated sequences against real DNA:

Lower KL divergence → better biological realism
Motif frequencies compared to ground truth

🚀 Usage

🔹 Load Model

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "<your-username>/dna-bio-lm-codon-tpu"
)

🔹 Load Vocabulary

import json

with open("codon_vocab.json") as f:
    vocab = json.load(f)

🔹 Generate DNA

# Example: Generate promoter sequence
task_token = vocab["<PROMOTER>"]
input_ids = [task_token, vocab["<bos>"]]

# Feed into model.generate()

🧬 Capabilities

Generate biologically realistic DNA sequences
Learn codon-level dependencies
Capture motif patterns
Condition generation on biological tasks

⚠️ Limitations

Not a full gene expression predictor
No protein translation modeling
Limited to sequence-level patterns
Requires biological validation for real-world use

⚠️ Risks & Ethical Considerations

Generated DNA may resemble real sequences
Not suitable for:
- clinical decisions
- genetic engineering
Must be used for research purposes only

🌍 Environmental Impact

Hardware: TPU v5e-8
Platform: Kaggle
Training duration: Several hours
Mixed precision reduces energy usage

🔬 Technical Insights

Codon tokenization reduces sequence length by 3×
Improves attention efficiency
Enables larger context modeling

📚 Citation

@misc{dna_bio_lm_2026,
  title={DNA-Bio-LM-Codon-TPU},
  author={<your-name>},
  year={2026},
  note={Biologically conditioned DNA language model with codon tokenization}
}

📬 Contact

Author: praveen
Hugging Face: https://huggingface.co/prav-974

🙏 Acknowledgements

InstaDeepAI
Hugging Face
Open genomics research community

🔥 Summary

This model represents a shift from:

"DNA as text" → "DNA as structured biological language"

and introduces a more biologically grounded approach to genomic language modeling.

Downloads last month: 31

Safetensors

Model size

38M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support