Detecting and quantifying overparametrization in RNA language models with REDIAL
Detecting and quantifying overparametrization in RNA language models with REDIAL
Teng, D.; Qiu, Y.; Sakthivel, G.; Aranganathan, A.; Herron, L.; Tiwary, P.
AbstractWhile RNA language models (LMs) have served as foundation models (FMs) to advanced structural prediction, their evaluation relies heavily on supervised downstream tasks. Such tasks can often mask FM inefficiencies and reflect downstream training set memorization. To address this, here we introduce REDIAL (RNA Embedding perturbation Diagnostics for Language models), a zero-shot, unsupervised framework designed to extract coevolutionary signals directly from the high-dimensional latent spaces of RNA language models. By applying REDIAL, we uncover stark, layer-wise disparities in how popular RNA LMs internalize structural constraints through a layer-wise dissection and ablation study. Our results showed how such layerwise behavior deviates from protein LMs and is related to design flaws in the architectures. Specifically, we show that current RNA LMs are severely overparameterized relative to the limited sequence diversity of available RNA databases, leading to profound parameter inefficiency and overfitting. Furthermore, we establish that structure-guided pre-training fundamentally improves the signal-to-noise ratio of learned coevolutionary couplings compared to sequence-only baselines. Ultimately, this unsupervised evaluation paradigm exposes critical flaws in current parameter scaling strategies and provides a rigorous diagnostic benchmark to guide the development of more efficient, generalizable foundation models for RNA therapeutics and de novo design.