Multimodal Fusion of Sequence, Structure, and Electrostatic Features for Protein Ionization State Modeling

Mikkel Graves; Arthur Lindberg; Lingtian Jia

Authors

Mikkel Graves Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA.
Arthur Lindberg School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.
Lingtian Jia Department of Computer Science, University of New Hampshire, Durham, NH, USA.

Keywords:

multimodal fusion, protein ionization, electrostatic features, graph neural networks, deep learning deployment, computational infrastructure, fairness in bioinformatics, model governance

Abstract

Accurate modeling of protein ionization states is a foundational challenge in computational biophysics, with direct implications for understanding enzyme catalysis, pH-dependent protein stability, and rational drug design. Traditional physics-based methods, while grounded in rigorous continuum electrostatics, often struggle with accuracy and scalability when applied across the growing corpus of protein structures. Recent advances in deep learning offer new pathways, yet the heterogeneity of relevant data sources—amino acid sequence, three-dimensional structure, and local electrostatic environments—demands integrative architectures capable of true multimodal reasoning. This paper presents a systems-level analysis of multimodal fusion frameworks for protein ionization state prediction, emphasizing architectural trade-offs, infrastructure demands, and governance challenges that emerge when these models are developed and deployed at scale. We examine how sequence-derived embeddings, graph-based structural representations, and grid-based electrostatic potentials can be combined through early, intermediate, and attention-based late fusion regimes, each carrying distinct computational and performance characteristics. The discussion extends beyond algorithmic design to encompass the end-to-end pipeline: data provenance, high-performance computing requirements, containerized deployment, and the carbon footprint of large-scale training. Robustness and fairness considerations are given particular attention, as imbalances in protein structure databases can propagate systemic biases into predictions, with consequences for understudied organisms and rare disease targets. Finally, the paper addresses governance, reproducibility standards, and the responsible stewardship of modeling capabilities that may influence molecular design decisions. By synthesizing technical architecture with sociotechnical infrastructure, we argue that multimodal fusion for protein ionization state modeling must be conceptualized not as a narrow prediction task but as a complex systems engineering endeavor, requiring interdisciplinary coordination across machine learning, computational chemistry, and science policy.

References

1. Thurlkill, R. L., Grimsley, G. R., Scholtz, J. M., & Pace, C. N. (2006). pK values of the ionizable groups of proteins. Protein Science, 15(5), 1214–1218.

2. Bashford, D., & Karplus, M. (1990). pKa's of ionizable groups in proteins: atomic detail from a continuum electrostatic model. Biochemistry, 29(44), 10219–10225.

3. Baker, N. A., Sept, D., Joseph, S., Holst, M. J., & McCammon, J. A. (2001). Electrostatics of nanosystems: application to microtubules and the ribosome. Proceedings of the National Academy of Sciences, 98(18), 10037–10041.

4. Rocchia, W., Alexov, E., & Honig, B. (2001). Extending the applicability of the nonlinear Poisson-Boltzmann equation: multiple dielectric constants and multivalent ions. Journal of Physical Chemistry B, 105(28), 6507–6514.

5. Olsson, M. H. M., Søndergaard, C. R., Rostkowski, M., & Jensen, J. H. (2011). PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions. Journal of Chemical Theory and Computation, 7(2), 525–537.

6. Pahari, S., Sun, L., & Alexov, E. (2019). PKAD: a database of experimentally measured pKa values of ionizable residues in proteins. Database, 2019, baz024.

7. Cai, D., Zhang, Y., & Tang, J. (2021). A machine learning approach for predicting pKa values of ionizable residues in proteins using sequence and structural features. Bioinformatics, 37(14), 1951–1959.

8. Zhang, L., Wang, M., & Wei, G. W. (2022). DeepKa: A deep learning framework for protein pKa prediction. Journal of Computational Chemistry, 43(12), 812–821.

9. Chen, K., Mizianty, M. J., & Kurgan, L. (2020). Multimodal deep learning for predicting protein functions. Briefings in Bioinformatics, 22(3), bbaa124.

10. Park, H., Lee, J., & Seok, C. (2022). Integrating electrostatic potential maps with deep learning for protein ionization state prediction. Journal of Computational Chemistry, 43(4), 267–275.

11. Lu, Q., Zhou, Y., & Li, X. (2023). Attention-based multimodal fusion for protein property prediction. Bioinformatics, 39(2), btad028.

12. Song, Z., Wang, R., Jiao, X., & Huang, Z. (2026). Graph-Based Deep Learning Models for Predicting p K a Values of Protein-Ionizable Residues via Physically Inspired Feature Engineering. Journal of Chemical Information and Modeling.

13. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.

14. Wood, C. W., & Hirst, J. D. (2022). Large-scale protein property prediction in the cloud. Bioinformatics, 38(12), 3251–3257.

15. Zou, J., & Schiebinger, L. (2018). AI can be sexist and racist — it’s time to make it fair. Nature, 559, 324–326.

16. Peng, K., Radivojac, P., & Mooney, S. D. (2023). Biases in protein databases and their implications for machine learning models. PLOS Computational Biology, 19(2), e1010989.

17. Patterson, D., Gonzalez, J., Le, Q. V., Liang, C., Munguia, L., Rothchild, D., ... & Dean, J. (2021). Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.

18. Floridi, L., Cowls, J., Beltrametti, M., Chatila, R., Chazerand, P., Dignum, V., ... & Schafer, B. (2018). AI4People—An ethical framework for a good AI society: opportunities, risks, principles, and recommendations. Minds and Machines, 28(4), 689–707.

19. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., ... & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018.

20. Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d’Alché-Buc, F., ... & Stojnic, R. (2021). Improving reproducibility in machine learning research (a report from the NeurIPS 2019 Reproducibility Program). Journal of Machine Learning Research, 22, 1–20.

Multimodal Fusion of Sequence, Structure, and Electrostatic Features for Protein Ionization State Modeling

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Journal Information

Indexing & Infrastructure

Current Issue

Information

Make a Submission