Science Cast

Why phylogenies compress so well: combinatorial guarantees under the Infinite Sites Model

librarianMarch 27, 2026 11:56am

Views (2)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Why phylogenies compress so well: combinatorial guarantees under the Infinite Sites Model

bioRxivPDFMarch 27, 2026 12:00am

Authors

Hendrychova, V.; Brinda, K.

Abstract

One important question in bacterial genomics is how to represent and search modern million-genome collections at scale. Phylogenetic compression effectively addresses this by guiding compression and search via evolutionary history, and many related methods similarly rely on tree- and ordering-based heuristics that leverage the same underlying phylogenetic signal. Yet, the mathematical principles underlying phylogenetic compression remain little understood. Here, we introduce the first formal framework to model phylogenetic compression mechanisms. We study genome collections represented as RLE-compressed SNP, k-mer, unitig, and uniq-row matrices and formulate compression as an optimization problem over genome orderings. We prove that while the problem is NP-hard for arbitrary data, for genomes following the Infinite Sites Model it becomes optimally solvable in polynomial time via Neighbor Joining (NJ). Finally, we experimentally validate the model's predictions with real bacterial datasets using an exact Traveling Salesperson Problem (TSP). We demonstrate that, despite numerous simplifying assumptions, NJ orderings achieve near-optimal compression across dataset types, representations, and k-mer ranges. Altogether, these results explain the mathematical principles underlying the efficacy of phylogenetic compression and, more generally, the success of tree-based compression and indexing heuristics across bacterial genomics.

TwitterandLinkedIn

0 comments

Add comment

Why phylogenies compress so well: combinatorial guarantees under the Infinite Sites Model

Why phylogenies compress so well: combinatorial guarantees under the Infinite Sites Model

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments