Improving Biomedical Knowledge Graph Construction through Large Language Model Driven Literature Mining and Semantic Relationship Extraction

Jeremy Langston; Patrick Pennington; Gordon Pennington; Tristan Whitaker

Authors

Jeremy Langston Department of Computer Science; University of Nevada, Reno
Patrick Pennington School of Biomedical Informatics; The University of Texas Health Science Center at Houston
Gordon Pennington Department of Information Systems; University of Arkansas at Little Rock
Tristan Whitaker Department of Health Data Science; Saint Louis University

Keywords:

Biomedical knowledge graphs, large language models, literature mining, semantic relationship extraction, biomedical informatics, transformer architectures, ontology integration, artificial intelligence infrastructure, scientific text mining, translational medicine

Abstract

Biomedical knowledge graphs have emerged as foundational infrastructures for organizing heterogeneous biomedical information across genomics, clinical medicine, pharmacology, public health, and translational research ecosystems. However, the rapid expansion of scientific literature, fragmented biomedical ontologies, inconsistent terminologies, and continuously evolving semantic relationships have significantly constrained the scalability and reliability of conventional biomedical knowledge graph construction methodologies. Traditional rule-based extraction pipelines and manually curated semantic integration frameworks frequently struggle to maintain semantic consistency, contextual precision, and adaptive scalability under contemporary data growth conditions. Recent advances in large language models have introduced transformative possibilities for literature mining, semantic reasoning, entity alignment, and contextual relationship extraction across large-scale biomedical corpora. This paper examines the architectural transformation of biomedical knowledge graph construction through large language model driven literature mining and semantic relationship extraction systems. The study analyzes the integration of transformer-based language architectures into biomedical information pipelines, focusing on system interoperability, semantic robustness, governance infrastructures, computational sustainability, and deployment challenges across distributed biomedical environments. Particular attention is devoted to the interaction between biomedical ontologies, language model reasoning capabilities, domain adaptation mechanisms, and human-in-the-loop validation systems. The paper further evaluates structural trade-offs between automation and interpretability, centralized and federated infrastructures, and generative inference and symbolic biomedical reasoning. Through a systems-oriented analysis, the study demonstrates that large language model enhanced biomedical knowledge graph ecosystems can significantly improve semantic coverage, contextual accuracy, and cross-domain integration while simultaneously introducing new governance, bias, reproducibility, and infrastructural risks. The paper concludes by proposing a future-oriented framework for sustainable, trustworthy, and policy-aware biomedical knowledge graph infrastructures capable of supporting next-generation precision medicine and translational biomedical discovery.

References

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., & Sherlock, G. (2000). Gene ontology: Tool for the unification of biology. Nature Genetics, 25(1), 25–29.

Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 3615–3620.

Bodenreider, O. (2004). The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Research, 32(Database issue), D267–D270.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019, 4171–4186.

Ehrlinger, L., & Wöß, W. (2016). Towards a definition of knowledge graphs. SEMANTiCS (Posters, Demos, SuCCESS), 48, 1–4.

Friedman, C., & Hripcsak, G. (1999). Natural language processing and its future in medicine. Academic Medicine, 74(8), 890–895.

Gaudelet, T., Day, B., Jamasb, A. R., Soman, J., Regep, C., Liu, G., Hayter, J. B., Vickers, R., Roberts, C., Tang, J., & Blundell, T. (2021). Utilizing graph machine learning within drug discovery and development. Briefings in Bioinformatics, 22(6), bbab159.

Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo, G., Gutierrez, C., Kirrane, S., Gayo, J. E. L., Navigli, R., Neumaier, S., Ngomo, A. C. N., Polleres, A., Rashid, S. M., Rula, A., Schmelzeisen, L., Sequeda, J., Staab, S., & Zimmermann, A. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1–37.

Ji, S., Pan, S., Cambria, E., Marttinen, P., & Yu, P. S. (2022). A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33(2), 494–514.

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., & Lu, X. (2019). PubMedQA: A dataset for biomedical research question answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2567–2577.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240.

Liu, S., Tang, B., Chen, Q., & Wang, X. (2015). Drug-drug interaction extraction via convolutional neural networks. Computational and Mathematical Methods in Medicine, 2016, 1–8.

Miotto, R., Wang, F., Wang, S., Jiang, X., & Dudley, J. T. (2018). Deep learning for healthcare: Review, opportunities and challenges. Briefings in Bioinformatics, 19(6), 1236–1246.

Murray-Rust, P., & Murray-Rust, D. (2015). Text mining in chemistry. Nature Reviews Chemistry, 1(1), 1–2.

Peng, Y., Yan, S., & Lu, Z. (2019). Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. Proceedings of the 2019 Workshop on Biomedical Natural Language Processing, 58–65.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.

Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347–1358.

Rotmensch, M., Halpern, Y., Tlimat, A., Horng, S., & Sontag, D. (2017). Learning a health knowledge graph from electronic medical records. Scientific Reports, 7(1), 5994.

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A. K., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Schärli, N., Chowdhery, A., Mansfield, P., Arcas, B. A. Y., Webster, D., & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180.

Smaili, F. Z., Gao, X., & Hoehndorf, R. (2019). Formal axioms in biomedical ontologies improve analysis and interpretation of associated data. Bioinformatics, 35(12), 2228–2235.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.

Wang, Q., Mao, Z., Wang, B., & Guo, L. (2017). Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12), 2724–2743.

Wei, C. H., Allot, A., Leaman, R., & Lu, Z. (2019). PubTator central: Automated concept annotation for biomedical full text articles. Nucleic Acids Research, 47(W1), W587–W593.

Wu, S., Roberts, K., Datta, S., Du, J., Ji, Z., Si, Y., Soni, S., Wang, Q., Wei, Q., Xiang, Y., Zhao, B., & Xu, H. (2020). Deep learning in clinical natural language processing: A methodical review. Journal of the American Medical Informatics Association, 27(3), 457–470.

Zhang, Y., Chen, Q., Yang, Z., Lin, H., & Lu, Z. (2019). BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data, 6(1), 52.

Zheng, J., Fu, Z., Li, T., & Xiong, H. (2023). Large language models for healthcare: Opportunities, challenges, and applications. arXiv preprint arXiv:2305.10037.

Zhou, D., Wang, X., & He, Y. (2020). Biomedical knowledge graph mining for drug discovery. Methods, 179, 212–221.

Improving Biomedical Knowledge Graph Construction through Large Language Model Driven Literature Mining and Semantic Relationship Extraction

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Journal Information

Indexing & Infrastructure

Current Issue

Information

Make a Submission