BPE-DNA-Tokenizer / README.md
abi96062's picture
Update README.md
d7610b8 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: BPE DNA Tokenizer
emoji: 🧬
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit

🧬 BPE DNA Tokenizer

An interactive demo of a Byte Pair Encoding (BPE) tokenizer trained on the E. coli K-12 genome.

🎯 Key Results

  • Vocabulary Size: 5,000 tokens
  • Compression Ratio: 5.208x (62.8% above requirement)
  • Dataset: E. coli K-12 genome (4.6M base pairs)
  • Lossless: 100% perfect reconstruction

✨ Features

  • 🧬 DNA-Optimized: Specifically designed for genomic sequences
  • πŸš€ High Compression: Achieves 5.2x compression
  • πŸ”¬ Biological Discovery: Automatically finds codons, TATA boxes, and more
  • βœ… Lossless: Perfect encode-decode reconstruction

πŸ”¬ Discovered Patterns

The tokenizer learned biologically meaningful patterns without supervision:

  • Start Codon: ATG
  • Stop Codons: TAA, TAG
  • TATA Box: TATAA
  • Shine-Dalgarno: AGGAGG
  • CpG Islands: GCGC

πŸš€ Try It Out

  1. Enter any DNA sequence (A, C, G, T, N)
  2. Click "Tokenize Sequence"
  3. See the compression statistics and token breakdown

πŸ“Š Model Details

  • Training Data: 4,641,652 base pairs
  • Compressed Size: 891,316 tokens
  • Training Time: 88 minutes
  • Longest Token: 26 bases

πŸ”— Links


Built for genomics and machine learning πŸ§¬πŸ€–