AI & ML interests

None defined yet.

Recent Activity

faizathย  updated a Space 5 days ago
HantaBERT/README
faizathย  published a Space 6 days ago
HantaBERT/README
faizathย  updated a dataset 6 days ago
HantaBERT/Orthohantavirus-Genome-Atlas
View all activity

Organization Card

HantaBERT

Multi-Task Orthohantavirus classification by fine-tuning DNABERT-2.
One forward pass โ†’ species, host, and geographic origin, plus a 768-d phylogenetic embedding.

๐ŸŒ Web App  ยท  โšก API Docs  ยท  ๐Ÿค— Model  ยท  ๐Ÿ“Š Dataset  ยท  ๐Ÿ’ป GitHub


What is HantaBERT?

Hantaviruses (genus Orthohantavirus) are segmented negative-sense ssRNA viruses that cause hemorrhagic fever with renal syndrome (HFRS) in Eurasia and cardiopulmonary syndrome (HCPS) in the Americas, with mortality reaching ~40% in HCPS cases. Rapidly identifying the species, reservoir host, and geographic origin of a sequence is essential for surveillance, but BLAST and classical phylogeny are slow and do not integrate across attributes.

HantaBERT fine-tunes DNABERT-2 (117M) into a multi-task model that emits probabilities for all three tasks in a single forward pass. A shared 768-d bottleneck feeds three independent classification heads, trained with a weighted combined loss, balanced classes, AMP fp16, gradient accumulation, and a differential learning rate between encoder and heads.

๐Ÿ“ˆ Headline results

Held-out test set (883 sequences), neural classification heads:

Task Classes Test accuracy
๐Ÿงฌ Species / lineage 23 96.7%
๐Ÿ€ Host (Rodent / Human / Others) 3 91.4%
๐ŸŒ Geographic origin 7 80.5%
Training curves Species confusion matrix
Training progression: val accuracy climbs to 96.7% / 91.4% / 80.5% over 10 epochs. Species confusion matrix: clean diagonal across 23 lineages.

Emergent phylogenetic structure

A UMAP projection of all 8,822 bottleneck embeddings reveals clean per-lineage clusters (and substructure per genome segment S, M, L) with no explicit supervision of the segment. The S/M/L separation tracks differences in selective pressure (conserved N protein on S, antigenic positive selection on Gn/Gc in M, active RdRp motifs on L).

UMAP of all species UMAP Seoul virus UMAP Puumala virus
All 8,822 sequences by lineage Seoul virus (1,391) Puumala virus (2,709)

๐Ÿ—‚๏ธ Project components

The HantaBERT stack spans data collection, modeling, a public API, and a web interface, each in its own repository.

๐Ÿ“Š Data pipeline & dataset

Automated extraction, cleaning, multi-task labeling, and geocoding of Orthohantavirus genomic records from NCBI GenBank (Biopython + Nominatim). Produces the ready-to-train dataset of S/M/L RNA segments with standardized host, species, and geography labels.

๐Ÿง  Model: training & fine-tuning

The multi-task fine-tuning code: MultiTaskHantaBERT (DNABERT-2 encoder โ†’ shared bottleneck โ†’ 3 heads), weighted loss 1.0ยทL_species + 0.5ยทL_host + 0.3ยทL_geo, full train / evaluate / visualize scripts, and an SVM-on-embeddings baseline.

โšก Inference API

FastAPI + Uvicorn service, packaged with Docker. Accepts raw DNA/RNA or FASTA, auto-converts Uโ†’T, and returns top-N probabilistic predictions per task.

๐ŸŒ Web interface

Pure static HTML/CSS/JS frontend with an interactive world map (D3 + TopoJSON). Paste a sequence or upload a FASTA file and explore ranked predictions across all three tasks.

HantaBERT web interface

๐Ÿ“„ Paper

HantaBERT: Multi-Task Hantavirus Classification with DNABERT-2 Fine-Tuning, an IEEE-style conference paper (English & Indonesian), written for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.


๐Ÿš€ Quick links


๐Ÿ‘ฅ Authors

Muhammad Faiz Atharrahman ยท Muhammad Rafi Dhiyaulhaq ยท Lydia Gracia, School of Electrical Engineering and Informatics (STEI), Institut Teknologi Bandung.

Developed as the final project for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.

Released under the Apache-2.0 license, consistent with the DNABERT-2 backbone. If you use HantaBERT, please also cite DNABERT-2 (Zhou et al., 2023).