YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
- 𧬠DNA-Bio-LM-Codon-TPU
- π Overview
- π§ Key Innovations
- ποΈ Model Architecture
- π Training Data
- βοΈ Training Setup
- π Training Behavior
- π§ͺ Evaluation
- π Usage
- 𧬠Capabilities
- β οΈ Limitations
- β οΈ Risks & Ethical Considerations
- π Environmental Impact
- π¬ Technical Insights
- π Citation
- π¬ Contact
- π Acknowledgements
- π₯ Summary
- π Overview
𧬠DNA-Bio-LM-Codon-TPU
A Biologically Conditioned DNA Language Model (3-mer Codon Transformer)
π Overview
DNA-Bio-LM-Codon-TPU is a transformer-based language model trained from scratch on real genomic sequences using biologically meaningful representations.
Unlike traditional DNA models, this model introduces:
- 3-mer (codon) tokenization
- Task-conditioned sequence generation
- Biological validation metrics
The model is trained on curated datasets from InstaDeepAI genomics benchmarks and focuses on learning real biological distributions rather than synthetic approximations.
π§ Key Innovations
𧬠1. Codon-Level Tokenization
Uses 3-mer tokens (64 codons) instead of single nucleotides
Captures:
- Start codons (
ATG) - Stop codons (
TAA,TAG,TGA) - Regulatory patterns
- Start codons (
π Treats DNA as a biological language, not raw characters
π― 2. Biological Task Conditioning
Sequences are conditioned with task tokens:
| Token | Meaning |
|---|---|
<ENHANCER> |
Enhancer regions |
<PROMOTER> |
Promoter regions |
<SPLICE> |
Splice sites |
<HISTONE> |
Histone modification regions |
π Enables controlled DNA generation
π« 3. No Synthetic Data
Only real genomic sequences used
Strict filtering:
- Valid bases: A, T, G, C, N
- Minimum length constraints
π Prevents learning fake biological patterns
π 4. Biological Evaluation Metrics
Model is evaluated using:
- GC content distribution
- ATG (start codon) frequency
- Stop codon frequency
- TATA box motif detection
- CpG island frequency
- Dinucleotide KL divergence
π Measures biological realism, not just loss
ποΈ Model Architecture
- Base Architecture: GPT-style Transformer
- Framework: PyTorch
- Library: Hugging Face Transformers
π§ Configuration
| Parameter | Value |
|---|---|
| Layers | 12 |
| Hidden Size | 512 |
| Attention Heads | 8 |
| FFN Size | 2048 |
| Max Sequence Length | 256 codons |
| Vocab Size | 73 tokens |
π Training Data
Source
- InstaDeepAI nucleotide transformer datasets
- Genomics long-range benchmarks
Processing
- Sequences converted to codon tokens
- Task labels mapped to conditioning tokens
- Chunked into fixed-length sequences
βοΈ Training Setup
| Component | Value |
|---|---|
| Hardware | TPU v5e-8 (Kaggle) |
| Precision | Mixed precision |
| Optimizer | AdamW |
| Scheduler | Cosine decay |
| Learning Rate | 3e-4 |
| Batch Size | 16 |
| Gradient Accumulation | 8 |
| Epochs | 5 |
π Training Behavior
Initial loss β log(73) β 4.3
Target:
- Loss < 2.0 by epoch 3
Perplexity used as auxiliary metric
π§ͺ Evaluation
Metrics
| Metric | Purpose |
|---|---|
| Loss | Training convergence |
| Perplexity | Sequence prediction quality |
| GC content | Biological plausibility |
| Dinucleotide KL | Distribution similarity |
Biological Validation
The model compares generated sequences against real DNA:
- Lower KL divergence β better biological realism
- Motif frequencies compared to ground truth
π Usage
πΉ Load Model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"<your-username>/dna-bio-lm-codon-tpu"
)
πΉ Load Vocabulary
import json
with open("codon_vocab.json") as f:
vocab = json.load(f)
πΉ Generate DNA
# Example: Generate promoter sequence
task_token = vocab["<PROMOTER>"]
input_ids = [task_token, vocab["<bos>"]]
# Feed into model.generate()
𧬠Capabilities
- Generate biologically realistic DNA sequences
- Learn codon-level dependencies
- Capture motif patterns
- Condition generation on biological tasks
β οΈ Limitations
- Not a full gene expression predictor
- No protein translation modeling
- Limited to sequence-level patterns
- Requires biological validation for real-world use
β οΈ Risks & Ethical Considerations
Generated DNA may resemble real sequences
Not suitable for:
- clinical decisions
- genetic engineering
Must be used for research purposes only
π Environmental Impact
- Hardware: TPU v5e-8
- Platform: Kaggle
- Training duration: Several hours
- Mixed precision reduces energy usage
π¬ Technical Insights
- Codon tokenization reduces sequence length by 3Γ
- Improves attention efficiency
- Enables larger context modeling
π Citation
@misc{dna_bio_lm_2026,
title={DNA-Bio-LM-Codon-TPU},
author={<your-name>},
year={2026},
note={Biologically conditioned DNA language model with codon tokenization}
}
π¬ Contact
- Author: praveen
- Hugging Face: https://huggingface.co/prav-974
π Acknowledgements
- InstaDeepAI
- Hugging Face
- Open genomics research community
π₯ Summary
This model represents a shift from:
"DNA as text" β "DNA as structured biological language"
and introduces a more biologically grounded approach to genomic language modeling.
- Downloads last month
- 31