AI & ML interests
None defined yet.
Recent Activity
Multi-Task Orthohantavirus classification by fine-tuning DNABERT-2.
One forward pass โ species, host, and geographic origin, plus a 768-d phylogenetic embedding.
๐ Web App ยท โก API Docs ยท ๐ค Model ยท ๐ Dataset ยท ๐ป GitHub
What is HantaBERT?
Hantaviruses (genus Orthohantavirus) are segmented negative-sense ssRNA viruses that cause hemorrhagic fever with renal syndrome (HFRS) in Eurasia and cardiopulmonary syndrome (HCPS) in the Americas, with mortality reaching ~40% in HCPS cases. Rapidly identifying the species, reservoir host, and geographic origin of a sequence is essential for surveillance, but BLAST and classical phylogeny are slow and do not integrate across attributes.
HantaBERT fine-tunes DNABERT-2 (117M) into a multi-task model that emits probabilities for all three tasks in a single forward pass. A shared 768-d bottleneck feeds three independent classification heads, trained with a weighted combined loss, balanced classes, AMP fp16, gradient accumulation, and a differential learning rate between encoder and heads.
๐ Headline results
Held-out test set (883 sequences), neural classification heads:
| Task | Classes | Test accuracy |
|---|---|---|
| ๐งฌ Species / lineage | 23 | 96.7% |
| ๐ Host (Rodent / Human / Others) | 3 | 91.4% |
| ๐ Geographic origin | 7 | 80.5% |
![]() |
![]() |
| Training progression: val accuracy climbs to 96.7% / 91.4% / 80.5% over 10 epochs. | Species confusion matrix: clean diagonal across 23 lineages. |
Emergent phylogenetic structure
A UMAP projection of all 8,822 bottleneck embeddings reveals clean per-lineage clusters (and substructure per genome segment S, M, L) with no explicit supervision of the segment. The S/M/L separation tracks differences in selective pressure (conserved N protein on S, antigenic positive selection on Gn/Gc in M, active RdRp motifs on L).
![]() |
![]() |
![]() |
| All 8,822 sequences by lineage | Seoul virus (1,391) | Puumala virus (2,709) |
๐๏ธ Project components
The HantaBERT stack spans data collection, modeling, a public API, and a web interface, each in its own repository.
๐ Data pipeline & dataset
Automated extraction, cleaning, multi-task labeling, and geocoding of Orthohantavirus genomic records from NCBI GenBank (Biopython + Nominatim). Produces the ready-to-train dataset of S/M/L RNA segments with standardized host, species, and geography labels.
- ๐ค Dataset: HantaBERT/Orthohantavirus-Genome-Atlas:
raw(9,950),interim(9,846),defaultprocessed (9,846) - ๐ป Code: github.com/HantaBERT/data-pipeline
๐ง Model: training & fine-tuning
The multi-task fine-tuning code: MultiTaskHantaBERT (DNABERT-2 encoder โ shared bottleneck โ 3 heads), weighted loss 1.0ยทL_species + 0.5ยทL_host + 0.3ยทL_geo, full train / evaluate / visualize scripts, and an SVM-on-embeddings baseline.
- ๐ค Model: HantaBERT/HantaBERT
- ๐ป Code: github.com/HantaBERT/HantaBERT
โก Inference API
FastAPI + Uvicorn service, packaged with Docker. Accepts raw DNA/RNA or FASTA, auto-converts UโT, and returns top-N probabilistic predictions per task.
- ๐ Live: hantabert-api.faizath.com/docs
- ๐ป Code: github.com/HantaBERT/HantaBERT-API
๐ Web interface
Pure static HTML/CSS/JS frontend with an interactive world map (D3 + TopoJSON). Paste a sequence or upload a FASTA file and explore ranked predictions across all three tasks.
- ๐ Live: hantabert.faizath.com
- ๐ป Code: github.com/HantaBERT/HantaBERT-Web
๐ Paper
HantaBERT: Multi-Task Hantavirus Classification with DNABERT-2 Fine-Tuning, an IEEE-style conference paper (English & Indonesian), written for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.
- ๐ป Source & PDFs: github.com/HantaBERT/paper
๐ Quick links
๐ฅ Authors
Muhammad Faiz Atharrahman ยท Muhammad Rafi Dhiyaulhaq ยท Lydia Gracia, School of Electrical Engineering and Informatics (STEI), Institut Teknologi Bandung.
Developed as the final project for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.
Released under the Apache-2.0 license, consistent with the DNABERT-2 backbone. If you use HantaBERT, please also cite DNABERT-2 (Zhou et al., 2023).




