A Robust Machine Learning Framework for Keloid Biomarker Discovery Beyond Differential Expression
A Robust Machine Learning Framework for Keloid Biomarker Discovery Beyond Differential Expression
Daher, A.; Eftimie, R.; Afzal, F.
AbstractKeloids are fibroproliferative skin disorders arising following dermal injury that extend beyond the original wound margins. Their pathogenesis remains poorly understood, and current treatments are associated with high recurrence rates. Identifying transcriptomic biomarkers that distinguish keloids from other skin and scar phenotypes may provide insight into disease mechanisms and facilitate the development of targeted therapeutic approaches. However, previous transcriptomic studies have often been limited by small sample sizes, pairwise comparisons between tissue classes, heterogeneous data-integration strategies, and a reliance on conventional differential gene expression (DGE) analysis. Here, we employed a multi-stage machine learning (ML) workflow for robust keloid biomarker discovery using transcriptomic datasets derived from both bulk RNA sequencing and single-cell RNA sequencing (scRNA-seq). We assembled and harmonized, to the best of our knowledge, the largest curated cross-study keloid transcriptomic cohort currently available, comprising 81 samples from 13 independent studies spanning four clinically relevant tissue classes: normal skin, normotrophic scar, hypertrophic scar, and keloid scar. Through study-aware cross-validation, feature selection, partition-stability analysis, and bootstrap validation across multiple ML classifiers, we identified a panel of eight highly consistent biomarkers capable of distinguishing keloid from non-keloid samples. These biomarkers were associated with dysregulation of extracellular matrix homeostasis, fibrosis-resolution pathways, vascular remodelling, and metabolic reprogramming. Comparison with conventional DGE analysis demonstrated substantial agreement while also highlighting important differences between the two approaches. In particular, FASN was consistently identified by the ML workflow as an upregulated discriminatory biomarker despite exhibiting weak, non-significant differential expression in the DGE analysis. Cell-type-specific analysis further supported this finding, revealing significant FASN upregulation in fibroblast and vascular endothelial populations. These results demonstrate that ML and DGE capture complementary aspects of transcriptomic variation. This study provides a robust strategy for cross-study transcriptomic biomarker discovery and identifies candidate genes and pathways for future mechanistic and therapeutic investigation in keloids.