Uncertainty-Aware Robustness Enhancement of Large Language Model Agents for High-Stakes Medical Diagnosis and Treatment Recommendation

Nils Bowers; Siddharth J. Prasad; Fedfrey Waber; Enzo Mills

Authors

Nils Bowers Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY, USA.
Siddharth J. Prasad Department of Computer Science, Binghamton University, Binghamton, NY, USA.
Fedfrey Waber Department of Computer Science, University of Central Florida, Orlando, FL, USA.
Enzo Mills School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.

Keywords:

large language model agents, medical decision-making, uncertainty quantification, adversarial robustness, clinical decision support, system safety, fairness

Abstract

Large language model (LLM) agents are increasingly being proposed for clinical decision support, yet their deployment in high-stakes medical diagnosis and treatment recommendation remains fraught with unresolved uncertainty challenges. This paper presents a systems-level analysis of uncertainty-aware robustness enhancement for LLM agents operating in clinical environments. We examine the interplay between epistemic and aleatoric uncertainty sources, adversarial vulnerabilities, and the architectural trade-offs inherent in designing agents that can detect, quantify, and appropriately communicate uncertainty. A central argument advanced is that robustness cannot be attained through post-hoc calibration or prompting alone; instead, it requires deeply integrated architectural components such as Bayesian reasoning modules, retrieval-augmented evidence fusion, reinforcement learning with explicit safety constraints, and federated learning frameworks that preserve patient privacy while enabling continuous model improvement. The discussion extends to infrastructure requirements, including latency-accuracy trade-offs in real-time clinical settings, data governance, and the alignment of LLM agent behavior with evolving regulatory standards. We further address fairness considerations, highlighting how uncertainty-aware mechanisms can mitigate disparate performance across demographic subgroups by flagging low-confidence decisions for human review. Long-term sustainability and maintenance of these systems in hospital workflows are examined through the lens of model drift, concept shift, and the need for institutional oversight structures. By synthesizing insights from machine learning, medical informatics, and socio-technical systems theory, the paper offers a roadmap for building trustworthy LLM agents that do not merely generate plausible text but actively manage the limits of their own knowledge in life-critical settings.

References

1. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4171–4186). Association for Computational Linguistics.

2. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems (Vol. 33, pp. 1877–1901).

3. Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., ... & Natarajan, V. (2022). Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.

4. Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (pp. 1050–1059). PMLR.

5. Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (Vol. 30, pp. 6402–6413).

6. Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations.

7. Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (pp. 39–57). IEEE.

8. Rajpurkar, P., Irvin, J., Ball, R. L., Zhu, K., Yang, B., Mehta, H., ... & Lungren, M. P. (2018). Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLOS Medicine, 15(11), e1002686.

9. Hu, S. (2026). Research on Security Enhancement Methods for Adversarial Robust Large Language Model Intelligent Agents for Medical Decision-Making Tasks. arXiv preprint arXiv:2605.08257.

10. U.S. Food and Drug Administration. (2021). Artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD) action plan. FDA.

11. McMahan, H. B., Moore, E., Ramage, D., Hampson, S., & Agüera y Arcas, B. (2017). Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (pp. 1273–1282). PMLR.

12. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.

13. Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.

14. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

15. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (Vol. 33, pp. 9459–9474).

16. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

17. Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56.

18. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (Vol. 30, pp. 5998–6008).

19. Challen, R., Denny, J., Pitt, M., Gompels, L., Edwards, T., & Tsaneva-Atanasova, K. (2019). Artificial intelligence, bias and clinical safety. BMJ Quality & Safety, 28(3), 231–237.

20. Moor, M., Banerjee, O., Abad, Z. S. H., Krumholz, H. M., Leskovec, J., Topol, E. J., & Rajpurkar, P. (2023). Foundation models for generalist medical artificial intelligence. Nature, 616(7956), 259–265.

21. Schulam, P., & Saria, S. (2019). Can you trust this prediction? Auditing pointwise reliability after learning. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (pp. 1022–1031). PMLR.

22. Koh, P. W., & Liang, P. (2017). Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning (pp. 1885–1894). PMLR.

Uncertainty-Aware Robustness Enhancement of Large Language Model Agents for High-Stakes Medical Diagnosis and Treatment Recommendation

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Journal Information

Indexing & Infrastructure

Current Issue

Information

Make a Submission