Evaluating Data Partitioning Strategies for Accurate Prediction of Protein-Ligand Binding Free Energy Changes in Mutated Proteins

ElsevierVolume 27, 2025, Pages 4418-4430Computational and Structural Biotechnology JournalAuthor links open overlay panelLiangxu Xie, Guoming Bao, Dawei Zhang, Lei Xu, Xiaojun Xu, Shan ChangShow moreHighlights•

Evaluated impact of different data partitioning strategies on predicting mutation-induced changes in binding free energy.

UniProt-based partitioning reduces model prediction accuracy, highlighting potential overestimation from conventional methods.

Proposed an anchor-query partitioning framework, leveraging limited reference data to improve predictive generalization.

Abstract

Accurate prediction of the relative free energy of protein-ligand binding, especially regarding protein mutations, is vital for drug design and interpreting drug resistance. However, machine learning (ML) / deep learning (DL) methods often struggle with generalization due to dataset partitioning strategy. Random data partitioning potentially produces spuriously high correlations that inflate performance estimates. UniProt-based splitting preserves data independence but lacks high prediction accuracy. In this study, we first evaluate six distinct ML/DL models on the MdrDB database using two data partitioning methods. Protein sequences are embedded using the ESM-2 protein large language model, integrating wild-type and mutant features. Although all models show high predictive correlations (Pearson coefficients up to 0.70) under random partitioning, their performance declines with UniProt-based partitioning. To address this issue, we propose a query-anchor pairwise learning framework, utilizing known states as anchor points for predicting unknown query states. The proposed method is validated across three systems, revealing that even a small amount of reference data can significantly enhance prediction accuracy. This enhancement suggests that leveraging known states as anchor points allows for more precise predicting of unknown query states.

Graphical AbstractDownload: Download high-res image (177KB)Download: Download full-size imagePrevious article in issueNext article in issueKeywords

Mutant proteins

Relative free energy

Protein language model

Data partitioning strategy

Recommended articles

© 2025 The Author(s). Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology.

Comments (0)

No login
gif