Contrastive learning for antibody-antigen sequence-to-specificity prediction
Contrastive learning for antibody-antigen sequence-to-specificity prediction
Lee, H.; Castro, K.; Renwick, S.; Stalder, L.; Glanzer, W.; Kumar, R.; Chen, N.; Scheck, A.; Yermanos, A.; Mason, D.; Reddy, S. T.
AbstractPredicting which antibodies bind to which antigens directly from primary amino acid sequences remains a major challenge, as no current method can reliably determine this specificity at both a repertoire and proteome scale. Structure-based protein design frameworks can propose antibody binders to specified antigenic epitopes, but they do not solve the "sequence-to-specificity" task of mapping antibodies to cognate epitopes, and vice versa. Here, we introduce CALM (Cross-attention Adaptive Immune Receptor-Antigen Language Model), a dual-encoder plus cross-attentive decoder architecture that treats antibody-antigen recognition as molecular translation. Using contrastive learning, antigen and antibody encoders learn a shared embedding space that aligns cognate epitope-paratope binding pairs. CALM-1.0 is trained and evaluated on 4,138 curated antibody-antigen pairs obtained from the PDB-derived structural antibody database (SAbDab). On a leakage-controlled test split drawn from sequence clusters at 80% identity and unseen during training, CALM-1.0 achieves a mean top-1 retrieval (R@1) of 7%, with consistent performance across both directions (Ab to Ag and Ag to Ab). CALM establishes a foundation for bidirectional antibody-antigen sequence-to-specificity prediction with the potential to unify retrieval and generative design.