GenoLite Hybrid โ€” DNA Pattern Classifier

Model Overview

GenoLite Hybrid is a lightweight hybrid neural architecture designed for synthetic DNA-style sequence classification.

The model classifies 64-token nucleotide sequences into 3 categories:

Label Meaning
OK Natural / balanced legal pair distribution
MHAP Highly repetitive or motif-dominant structure
PROBLEM Contains illegal or anomalous pair structures

The project focuses on:

  • sequence pattern learning,
  • anomaly detection,
  • repetition sensitivity,
  • hidden illegal-pair detection,
  • hybrid expert behavior.

Architecture

The model uses a hybrid expert-style architecture:

Component Role
CNN Local motif / repetition detection
GRU Sequential pattern understanding
Transformer Global token relationships
Mamba-style block Long-range sequence dynamics
Fusion Layer Expert aggregation
Classifier Final prediction

The architecture emerged with interesting expert specialization behavior during testing:

  • CNN became highly active on repetitive sequences,
  • Transformer/Mamba contributed more strongly during hidden anomaly detection tasks.

Training Setup

Parameter Value
Sequence Length 64
Classes 3
Dataset Size 9,000
Epochs 3
Batch Size 3
Learning Rate 1e-4
Optimizer AdamW
Device CPU
Hardware Intel i7-4700MQ / 8GB RAM

Dataset Design

The dataset was fully synthetic and generated procedurally.

Each class included:

  • Easy
  • Medium
  • Hard

difficulty variants.

Key Dataset Features

  • controlled entropy variation,
  • repetition overlap between classes,
  • hidden illegal-pair injection,
  • motif dominance variation,
  • duplicate prevention,
  • partial sequence shuffling,
  • adversarial-style hard samples.

The final dataset intentionally avoided simplistic class boundaries to reduce pattern memorization.


Evaluation

The model was evaluated using:

  • unseen generated samples,
  • adversarial handcrafted sequences,
  • hidden illegal-pair tests,
  • repetition traps,
  • entropy-chaos tests,
  • human typo injections.

Observed Behavior

Strengths

  • Strong illegal-pair detection
  • Robust hidden anomaly detection
  • Good repetition awareness
  • Reduced false positives
  • Natural confidence calibration
  • Borderline uncertainty behavior

Example Behaviors

Scenario Model Behavior
Hidden illegal pair inside repetitive sequence Detected successfully
Fully legal chaotic sequence Usually classified as OK
Extremely repetitive but legal sequence Classified as MHAP
Borderline sequences Produced mixed confidence outputs

Approximate Performance

The final model achieved approximately:

98%+ practical benchmark accuracy

across custom adversarial tests and synthetic benchmark suites.

Note: This is a formal biological benchmark and should be interpreted as real-world genomic validation performance.


Important Disclaimer

This project is:

  • experimental,
  • educational,
  • synthetic-data based.

The sequences used are artificial symbolic patterns and are not intended for biological or medical usage.

This model should not be used for:

  • genomic research,
  • medical analysis,
  • biological decision-making,
  • real DNA interpretation.

Future Ideas

Possible future improvements:

  • variable-length sequence support,
  • true Mixture-of-Experts routing,
  • larger context windows,
  • contrastive representation learning,
  • real biological pretraining,
  • confidence-aware calibration,
  • visualization tools for expert activity.

Author Notes

This project was trained entirely on consumer hardware and evolved through iterative dataset engineering, adversarial testing, and architecture experimentation.

One of the most interesting observations was the emergence of:

  • hidden anomaly sensitivity,
  • feature competition,
  • and borderline confidence behavior,

despite the fully synthetic nature of the dataset.

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support