MLMarker: A machine learning framework for tissue inference and biomarker discovery
MLMarker: A machine learning framework for tissue inference and biomarker discovery
Claeys, T.; van Puyenbroeck, S.; Gevaert, K.; Martens, L.
AbstractMass spectrometry-based proteomics enables high-throughput profiling of protein expression across tissues but interpreting complex or sparse datasets remains challenging. We here present MLMarker, a machine learning-based tool trained on healthy human tissue proteomic data to compute continuous tissue similarity scores for new datasets. Using a Random Forest model trained across 34 tissue types, MLMarker predicts probabilistic tissue scores and explains these predictions using SHAP values at the single protein level for biological interpretability. To address missingness, a penalty factor can be used to correct for absent proteins contributing to predictions. We demonstrate MLMarker\'s utility across three public datasets: (i) cerebral melanoma metastases, where brain-like proteomic profiles were linked to poor treatment response and invasive phenotypes, revealing 241 differentially expressed proteins; (ii) a pan-cancer FFPE dataset, where tissue origin predictions reached 79% accuracy using broad tissue mappings despite the absence of cancer-specific training data; and (iii) cerebrospinal fluid and plasma, where the model identified brain and pituitary origin in biofluid samples. MLMarker offers an explainable, robust framework for hypothesis generation in proteomics, applicable across tumours, biofluids, and multi-tissue studies. It is available as a Python package and Streamlit GUI at https://mlmarker.streamlit.app.