When it comes to generating accurate ICU discharge summaries, not all large language models (LLM) are the same, according to new research.
AI offers an opportunity to optimise ICU care delivery, with AI-based algorithms capable of predicting patient deterioration and mortality having gained traction.
A study conducted by the Department of Anaesthesia and Intensive Care Medicine at University Hospital Galway, along with the School of Computer Science at the University of Galway, investigated the ability of ChatGPT, GPT-4 API and Llama2 to generate accurate and concise ICU discharge summaries.
The results show considerable variability between the three models in terms of accuracy, comprehensiveness and hallucinations, the study, published in the latest issue of Intensive Care Medicine Experimental journal, found.
Anonymised clinical notes generated by nurses, doctors, and pharmacists during each consecutive ICU admission were used to train the LLMs. The notes consisted of unstructured text, containing clinical terminology and abbreviations.
In the development phase, text from five ICU episodes was used to develop a series of prompts to best capture clinical summaries. In the testing phase, a summary produced by each LLM from an additional six ICU episodes was utilised for evaluation.
Each summary was scored based on the inclusion of a pre-defined number of relevant clinical events. Readability, organisation, succinctness, and accuracy were assessed using a five-point scale, with one being the best and three being the worst.
GPT-4 API far exceeded ChatGPT and Llama in its ability to recall a pre-defined list of important clinical events. Appropriate sequencing of facts was also highest for GPT-4 API, followed by ChatGPT and Llama2. GPT-4 had significantly higher scores for organisation and succinctness and scored slightly higher for readability and accuracy compared to ChatGPT. Llama2 had the lowest score for all parameters.
Llama2 produced generic and repetitive summaries that did not capture all clinical events. GPT-4 API and ChatGPT were noted to have good readability but omitted clinical events.
Hallucinations were noted in GPT-4 API summaries only.
Overall, none of the LLMs could identify more than 40 per cent of events considered by trained intensivists to be important, with major differences between open source and commercially available LLM providers.
According to the authors, this research demonstrates the varying capabilities of LLMs in handling complex medical data and highlights the challenges in achieving optimal accuracy and coherence in automated discharge summaries.
The findings show that further work is needed in terms of the safety and comprehensiveness of LLM-generated summaries before they are incorporated into clinical practice, particularly in relation to optimising the correct documentation of all clinical meaningful events.