Knowledge Graph-Enhanced Secure Large Language Model Agents for Explainable Clinical Decision-Making under Adversarial Attacks
Keywords:
large language models, knowledge graphs, clinical decision support, adversarial robustness, explainable AI, multi-agent systems, healthcare securityAbstract
The integration of large language model (LLM)-based autonomous agents into clinical decision support systems promises transformative gains in diagnostic accuracy, treatment personalization, and workflow efficiency. However, the deployment of such agents in high-stakes medical environments is contingent upon resolving fundamental tensions among security, explainability, and reliability under adversarial conditions. This paper presents a systems-level architectural framework that couples knowledge graph-enhanced reasoning with multi-layered defense mechanisms to enable secure, interpretable, and adversarially robust LLM agents for clinical decision-making. We examine the structural trade-offs involved in embedding curated biomedical knowledge graphs as semantic scaffolding that constrains agent reasoning, mitigates hallucination, and provides verifiable provenance for generated recommendations. The architecture incorporates adversarial training, input sanitization layers, prompt integrity verification, and runtime monitoring as part of a defense-in-depth strategy against evolving threat vectors including prompt injection, semantic perturbation, and model extraction. Through a conceptual analysis grounded in cross-domain comparisons with critical infrastructure systems, we analyze how the interaction between symbolic knowledge structures and subsymbolic inference engines can simultaneously enhance explainability and resilience. We further address governance, fairness, sustainability, and policy implications arising from the deployment of such agents within regulated healthcare ecosystems. The discussion highlights the need for standardized evaluation protocols, continuous certification pipelines, and inclusive design practices that account for dataset shifts, demographic representation, and long-term operational viability. The paper advances a holistic perspective that treats security, interpretability, and clinical utility not as separable modules but as interdependent properties that must be co-engineered across the entire agent lifecycle.
References
1. Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., ... & Wen, J. R. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), 186345.
2. Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56.
3. Wallace, E., Feng, S., Kandpal, N., Gardner, M., & Singh, S. (2019). Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 2153–2162).
4. Rotmensch, M., Halpern, Y., Tlimat, A., Horng, S., & Sontag, D. (2017). Learning a health knowledge graph from electronic medical records. Scientific Reports, 7(1), 5994.
5. Tonekaboni, S., Joshi, S., McCradden, M. D., & Goldenberg, A. (2019). What clinicians want: contextualizing explainable machine learning for clinical end use. In Machine Learning for Healthcare Conference (pp. 359–380). PMLR.
6. Jia, R., & Liang, P. (2017). Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2021–2031).
7. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations.
8. Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2016). A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1), 11–33.
9. Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3356–3369).
10. Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347–1358.
11. Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
12. Cohen, J. M., Rosenfeld, E., & Kolter, J. Z. (2019). Certified adversarial robustness via randomized smoothing. In Proceedings of the 36th International Conference on Machine Learning (pp. 1310–1320).
13. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144).
14. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (pp. 4765–4774).
15. Finlayson, S. G., Bowers, J. D., Ito, J., Zittrain, J. L., Beam, A. L., & Kohane, I. S. (2019). Adversarial attacks on medical machine learning. Science, 363(6433), 1287–1289.
16. Hu, S. (2026). Research on Security Enhancement Methods for Adversarial Robust Large Language Model Intelligent Agents for Medical Decision-Making Tasks. arXiv preprint arXiv:2605.08257.
17. Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). YAGO: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web (pp. 697–706).
18. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (pp. 1247–1250).
19. Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., de Melo, G., Gutierrez, C., ... & Zimmermann, A. (2022). Knowledge graphs. ACM Computing Surveys, 54(4), 1–37.
20. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.
21. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1–35.
22. Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., ... & Song, D. (2018). Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1625–1634).
23. Jin, D., Jin, Z., Zhou, J. T., & Szolovits, P. (2020). Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 8018–8025).
24. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4171–4186).
25. Miotto, R., Li, L., Kidd, B. A., & Dudley, J. T. (2016). Deep Patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports, 6, 26094.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Bioinformatics Insights and Analytics

This work is licensed under a Creative Commons Attribution 4.0 International License.



