Structure-to-Function Learning: Predicting Enzyme Catalytic Residue Activity Through pKa-Aware Graph Representations
Keywords:
enzyme function prediction, graph neural networks, pKa prediction, catalytic residues, structural bioinformatics, system design, computational sustainability, fairness in AIAbstract
The accurate prediction of enzyme catalytic residue activity from structural data stands as one of the central challenges in computational biology, with profound implications for drug discovery, industrial biocatalysis, and protein engineering. While recent advances in deep learning have enabled remarkable progress in protein function prediction, most existing methods either treat atomic-level chemical environments in an overly idealized manner or fail to integrate the critical physicochemical property of ionizable residue pKa values into their representations. This paper presents a framework for structure-to-function learning that constructs pKa-aware graph representations of enzyme active sites, jointly capturing three-dimensional spatial organization, evolutionary sequence features, and protonation-state energetics. At the system level, we examine the architectural trade-offs involved in encoding pKa information through physically inspired feature engineering within message-passing neural networks, including the balance between precomputed pKa predictions and on-the-fly electrostatic calculations. We provide a comprehensive analysis of the infrastructure required for large-scale deployment, spanning distributed training strategies, data provenance pipelines, and model-serving latency constraints in high-throughput screening contexts. Robustness is addressed through systematic evaluation of model sensitivity to structural perturbations, conformational sampling, and pKa prediction errors propagated from upstream modules. Fairness and bias considerations are discussed with respect to the overrepresentation of certain protein families in structural databases and the implications for generalizability to understudied enzyme classes. Sustainability concerns related to the computational footprint of training graph networks on massive structural datasets are evaluated alongside emerging efficient architectures. Finally, we outline governance and policy recommendations for the responsible dissemination of predictive models that could inform biocatalyst design, highlighting intellectual property boundaries, dual-use considerations, and the need for open benchmarking standards. Through this multidisciplinary lens, the work positions pKa-aware graph learning not merely as a technical advance in bioinformatics, but as a complex socio-technical system whose design choices reverberate through scientific practice, industrial deployment, and regulatory frameworks.
References
1. Punta, M., Rost, B., & Ofran, Y. (2012). The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS Computational Biology, 8(10), e1002733.
2. Furnham, N., Holliday, G. L., de Beer, T. A. P., Jacobsen, J. O. B., Pearson, W. R., & Thornton, J. M. (2014). The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic Acids Research, 42(D1), D485–D491.
3. Cilia, E., & Passerini, A. (2010). Automatic prediction of catalytic residues by modeling residue structural neighborhood. BMC Bioinformatics, 11, 115.
4. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.
5. Gligorijević, V., Renfrew, P. D., Kosciolek, T., Leman, J. K., Berenberg, D., Vatanen, T., ... & Bonneau, R. (2021). Structure-based protein function prediction using graph convolutional networks. Nature Communications, 12(1), 3168.
6. Nielsen, J. E., & McCammon, J. A. (2003). Calculating pKa values in enzyme active sites. Protein Science, 12(9), 1894–1901.
7. Anandakrishnan, R., Aguilar, B., & Onufriev, A. V. (2012). H++ 3.0: automating pK prediction and the preparation of biomolecular structures for atomistic molecular modeling and simulations. Nucleic Acids Research, 40(W1), W537–W541.
8. Torng, W., & Altman, R. B. (2019). 3D deep convolutional neural networks for amino acid environment similarity analysis. BMC Bioinformatics, 20(1), 287.
9. Hermosilla, P., Schäfer, M., Lang, M., Fackelmann, G., Vázquez-Reina, A., Kozlíková, B., ... & Ropinski, T. (2021). Intrinsic-extrinsic convolution and pooling for learning on 3D protein structures. In International Conference on Learning Representations.
10. Slupsky, J. D., & Derewenda, Z. S. (2017). Machine learning approaches to protein pKa prediction. Current Opinion in Structural Biology, 43, 131–137.
11. Rao, R., Bhattacharya, N., Thomas, N., Duan, Y., Chen, P., Canny, J., ... & Song, Y. S. (2019). Evaluating protein transfer learning with TAPE. In Advances in Neural Information Processing Systems.
12. Walsh, I., Pollastri, G., & Tosatto, S. C. E. (2014). Correct machine learning on protein sequences: a peer-reviewing perspective. Briefings in Bioinformatics, 15(5), 817–826.
13. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vassilev, L., Ozkaya, E., ... & Gebru, T. (2019). Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 220–229).
14. Chen, T., Guestrin, C., & Rojas, F. (2022). Accelerating biomolecular deep learning with lightweight surrogate models. Nature Computational Science, 2(8), 521–529.
15. Song, Z., Wang, R., Jiao, X., & Huang, Z. (2026). Graph-Based Deep Learning Models for Predicting p K a Values of Protein-Ionizable Residues via Physically Inspired Feature Engineering. Journal of Chemical Information and Modeling.
16. Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning (pp. 1050–1059).
17. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Young, M. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems.
18. Leman, J. K., Weitzner, B. D., Lewis, S. M., Adolf-Bryfogle, J., Alam, N., Alford, R. F., ... & Bonneau, R. (2020). Macromolecular modeling and design in Rosetta: recent methods and frameworks. Nature Methods, 17(7), 665–680.
19. Scudellari, M. (2021). Big data’s big bias problem. Nature, 595(7866), S6–S8.
20. Strubell, E., Ganesh, A., & McCallum, A. (2020). Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 13693–13696).
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Bioinformatics Insights and Analytics

This work is licensed under a Creative Commons Attribution 4.0 International License.



