Papers
arxiv:2606.06117

p-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences

Published on Jun 4
Authors:
,

Abstract

pVR is a topological machine learning framework that combines p-adic numbers with topological data analysis for genomic sequence classification, outperforming existing alignment-free methods on low-sample datasets.

We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines p-adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a p-adic distance on k-mer prefixes, which captures hierarchical positional structure, and a compositional L_1 distance on k-mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi-filtered Vietoris--Rips complex, and per-sequence topological summaries from this bi-filtration serve as features for standard machine learning classifiers. We establish theoretical guarantees for the construction: stability under metric perturbations and invariance to the choice of prime, alongside a result that explains why a single p-adic axis is topologically uninformative and why the bi-filtration recovers nontrivial homology. On twelve genomic benchmarks (28 to 500 sequences, 3 to 7 classes), pVR outperforms four established alignment-free baselines on three of six low-sample datasets, with gains of up to 21 percentage points; it underperforms only on a SARS-CoV-2 variant benchmark whose point-mutation divergence violates the hierarchical assumption, and all methods saturate in the large-sample regime. pVR also outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by 6.7 to 11.4 percentage points on three low-sample benchmarks. The pVR codebase is publicly available at https://github.com/MAHI-Group/pVR.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.06117
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.06117 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.06117 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.06117 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.