Summarizing RNA Structural Ensembles via Maximum Agreement Secondary Structures
Summarizing RNA Structural Ensembles via Maximum Agreement Secondary Structures
Gu, X.; Ivanovic, S.; Feng, D. W.; El-Kebir, M.
AbstractSummarizing a collection P of related RNA secondary structures is a key challenge in applications like evolutionary analysis, alternative fold studies and mRNA vaccine design. This requires both clustering the input structures into similar groups and identifying the core structural motifs on which they agree or differ. Existing methods fail by focusing on only one of these goals: clustering methods do not output shared motifs, while consensus methods overlook the structural diversity present in the collection. Here, we introduce the MAXIMUM AGREEMENT SECONDARY STRUCTURES (MASS) problem, which seeks the largest set F of structural features present in P that partition the input structures into a user-specified number {tau} of distinct clusters. We prove that MASS is NP-hard and also establish its equivalence to a constrained binary matrix projection problem. We present an exact integer linear program, an exact combinatorial algorithm, and a scalable beam-search heuristic. Using simulations we demonstrate the performance of these exact algorithms and heuristics relative to baseline methods that focus on either clustering or identifying a single consensus tree. On real data, we demonstrate that MASS identifies conserved scaffolds in conformational datasets, reveals conserved structural motifs in different species within RNA families, and recovers shared structural features among synonymous transcripts encoding the same protein. MASS provides a general and interpretable framework for summarizing RNA structural organization.