Self-supervised learning for a gene program-centric view of cell states
Self-supervised learning for a gene program-centric view of cell states
Moullet, M.; Isobe, T.; Vahidi, A.; Leonardi, C.; Paulas-Condori, L.; Soelistyo, C.; Steele, L.; Ly, K. C. H.; Quiroga Londono, M.; Mende, N.; Stephenson, E.; Iskander, D.; Webb, S.; Goh, I.; Vijayabaskar, M.; Liu, Q.; Jafree, D. J.; Asadollahzadeh, H.; Ramirez-Suastegui, C.; Merchant, A.; Roberts, K.; Rumney, B.; Chan, H. M.; Holland, A.; Prete, M.; Horsfall, D.; Basurto-Lozada, D.; Lee, J. Y. W.; Winheim, E.; Foster, A. R.; Vilarrasa-Blasi, R.; Hannah, R.; Mahil, S. K.; Smith, C.; Roy, A.; Roberts, I.; Laurenti, E.; Gottgens, B.; Vento-Tormo, R.; Haniffa, M.; Wilson, N. K.; Lotfollahi, M.
AbstractSingle-cell omics has extended the biological interrogation of cell state from examining the expression of individual genes to unbiased profiling of tens of thousands of genes at once. However, extracting biological insights from such high-dimensional data remains challenging. To enable downstream analyses, many computational approaches compress cell state into a single latent representation. This can obscure the structure of underlying gene programs (GP), defined as coordinated sets of biologically related genes, such as signalling pathway response modules or transcription factor targets. Here, we present Tripso, a self-supervised transformer deep learning model which learns multiple GP-specific embeddings from predefined GPs, while also enabling the discovery of novel, data-driven GPs. Tripso facilitates principled comparisons across development, disease, and experimental systems. Firstly, in a dataset of human hematopoietic cells spanning prenatal development through adulthood and aging and including newly generated data, Tripso resolved age-specific GP patterns, including elevated JAK-STAT activity in pediatric hematopoietic cells and postnatal shifts in IKZF1 GP activity during B cell differentiation. Secondly, leveraging Tripso GP embeddings and comparing in vivo to in vitro data, we hypothesized and experimentally validated that inhibition of the SEC61 translocon improved maintenance of hematopoietic stem cells in culture. Finally, Tripso's capacity for data-driven GP discovery revealed a previously uncharacterized tissue-resident memory T cell GP with increased activity in atopic dermatitis. Its spatial co-localization with sebaceous gland-associated immune niches was demonstrated in spatial transcriptomic and proteomic data. Thus, by moving beyond single embeddings of cellular states, Tripso enables interpretable and actionable discoveries, demonstrating how GP-centric modelling can generate hypotheses with substantial biomedical relevance. By anchoring cellular representations in meaningful GPs, Tripso establishes a principled and biologically grounded framework towards the development of interpretable virtual cell models.