Why phylogenies compress so well: combinatorial guarantees under the Infinite Sites Model
Why phylogenies compress so well: combinatorial guarantees under the Infinite Sites Model
Hendrychova, V.; Brinda, K.
AbstractOne important question in bacterial genomics is how to represent and search modern million-genome collections at scale. Phylogenetic compression effectively addresses this by guiding compression and search via evolutionary history, and many related methods similarly rely on tree- and ordering-based heuristics that leverage the same underlying phylogenetic signal. Yet, the mathematical principles underlying phylogenetic compression remain little understood. Here, we introduce the first formal framework to model phylogenetic compression mechanisms. We study genome collections represented as RLE-compressed SNP, k-mer, unitig, and uniq-row matrices and formulate compression as an optimization problem over genome orderings. We prove that while the problem is NP-hard for arbitrary data, for genomes following the Infinite Sites Model it becomes optimally solvable in polynomial time via Neighbor Joining (NJ). Finally, we experimentally validate the model's predictions with real bacterial datasets using an exact Traveling Salesperson Problem (TSP). We demonstrate that, despite numerous simplifying assumptions, NJ orderings achieve near-optimal compression across dataset types, representations, and k-mer ranges. Altogether, these results explain the mathematical principles underlying the efficacy of phylogenetic compression and, more generally, the success of tree-based compression and indexing heuristics across bacterial genomics.