Integrating Protein Language Models and Structural Graph Learning for Accurate Ionizable Residue pKa Estimation

Grjan Besai; Parth Tandon

Authors

Grjan Besai Department of Computer Science, Binghamton University, Binghamton, NY, USA.
Parth Tandon Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA.

Keywords:

protein pKa prediction, protein language models, graph neural networks, ionizable residues, structural bioinformatics, deep learning infrastructure, fairness

Abstract

Accurate estimation of ionizable residue pKa values is essential for understanding protein stability, enzymatic mechanism, and molecular recognition, yet it remains a formidable challenge due to the complex interplay of local electrostatics, solvent exposure, and conformational dynamics. Traditional empirical and continuum electrostatic methods have served as workhorses for decades, but they often falter in highly perturbed protein interiors or at catalytic sites. Recent advances in deep learning, particularly protein language models and graph neural networks, open new avenues for data-driven pKa prediction by capturing evolutionary sequence signatures and geometric constraints. This paper presents a systems-level investigation into the integration of protein language model embeddings with structural graph learning for pKa estimation, moving beyond incremental algorithmic improvement to examine the full lifecycle of such models. We analyze the architectural trade-offs between sequence-derived embeddings and three-dimensional graph representations, the data infrastructure required to assemble and curate training corpora, and the robustness of hybrid predictors under distributional shift. We further address fairness considerations arising from imbalanced representation of protein families and taxonomic groups, and discuss the interpretability demands placed on models deployed in drug discovery pipelines. Governance frameworks for integrating predictions into experimental workflows, the sustainability of large-scale model training, and strategies for continuous deployment are examined in depth. By synthesizing cross-domain insights from computational biophysics, machine learning, and socio-technical infrastructure studies, this work proposes a blueprint for designing, evaluating, and responsibly deploying integrated pKa prediction systems.

References

1. Nielsen, J. E., & Vriend, G. (2001). Optimizing the hydrogen bond network in Poisson–Boltzmann equation-based pKa calculations. Proteins: Structure, Function, and Bioinformatics, 43(4), 403–412.

2. Olsson, M. H. M., Søndergaard, C. R., Rostkowski, M., & Jensen, J. H. (2011). PROPKA3: Consistent treatment of internal and surface residues in empirical pKa predictions. Journal of Chemical Theory and Computation, 7(2), 525–537.

3. Gunner, M. R., & Alexov, E. (2020). Methods to predict pKa values of ionizable groups in proteins. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, 1868(2), 140337.

4. Chen, A. Y., & Brooks III, C. L. (2022). DeepKa: A deep-learning-based method for protein pKa prediction. Journal of Chemical Information and Modeling, 62(21), 5547–5556.

5. Song, Z., Wang, R., Jiao, X., & Huang, Z. (2026). Graph-Based Deep Learning Models for Predicting p K a Values of Protein-Ionizable Residues via Physically Inspired Feature Engineering. Journal of Chemical Information and Modeling.

6. Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C. L., Ma, J., & Fergus, R. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15), e2016239118.

7. Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinhardt, M., Bhowmik, D., & Rost, B. (2022). ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 7112–7127.

8. Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., & Rives, A. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34, 29287–29303.

9. Radak, B. K., & Roux, B. (2016). Constant pH molecular dynamics in explicit solvent with a new charge-scaling approach. Journal of Chemical Theory and Computation, 12(10), 4769–4777.

10. Gligorijević, V., Renfrew, P. D., Kosciolek, T., Leman, J. K., Cho, K., Vreven, T., Bileschi, M. L., Cheng, J., Stouch, T., Ostrov, N., & Khoshgoftaar, T. M. (2021). Structure-based protein function prediction using graph convolutional networks. Nature Communications, 12, 3168.

11. Jing, B., Eismann, S., Suriana, P., Townshend, R. J. L., & Dror, R. O. (2021). Learning from protein structure with geometric vector perceptrons. International Conference on Learning Representations.

12. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., ... Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.

13. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., & Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Research, 28(1), 235–242.

14. Finn, R. D., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Mistry, J., Mitchell, A. L., Potter, S. C., Punta, M., Qureshi, M., Sangrador-Vegas, A., Salazar, G. A., Tate, J., & Bateman, A. (2016). The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Research, 44(D1), D279–D285.

15. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

16. The UniProt Consortium. (2021). UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Research, 49(D1), D480–D489.

17. Thurlkill, R. L., Grimsley, G. R., Scholtz, J. M., & Pace, C. N. (2006). pK values of the ionizable groups of proteins. Protein Science, 15(5), 1214–1218.

18. Dolinsky, T. J., Nielsen, J. E., McCammon, J. A., & Baker, N. A. (2004). PDB2PQR: An automated pipeline for the setup of Poisson–Boltzmann electrostatics calculations. Nucleic Acids Research, 32(Suppl 2), W665–W667.

19. Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 6402–6413.

20. Alexov, E., Mehler, E. L., Baker, N., Baptista, A. M., Huang, Y., Milletti, F., Nielsen, J. E., Farrell, D., Carstensen, T., Shen, J., Warwicker, J., Connolly, S., Gunner, M. R., & Warshel, A. (2011). Progress in the prediction of pKa values in proteins. Proteins: Structure, Function, and Bioinformatics, 79(12), 3260–3275.

21. Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic attribution for deep networks. Proceedings of the 34th International Conference on Machine Learning, 3319–3328.

22. Ying, R., Bourgeois, D., You, J., Zitnik, M., & Leskovec, J. (2019). GNNExplainer: Generating explanations for graph neural networks. Advances in Neural Information Processing Systems, 32, 9240–9251.

23. Merkel, D. (2014). Docker: Lightweight Linux containers for consistent development and deployment. Linux Journal, 2014(239), 2.

24. Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650.

25. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J. W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., ... Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018.

Integrating Protein Language Models and Structural Graph Learning for Accurate Ionizable Residue pKa Estimation

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Journal Information

Indexing & Infrastructure

Current Issue

Information

Make a Submission