Referral Notes:
- LLMs could be used to help clinicians respond to patient messages, potentially reducing workload.
- In a pilot study, one of the first of its kind to evaluate LLM performance in thyroid disease, LLM responses to patient questions were more accurate, complete, empathetic, and consistent than online physician responses.
- In order to reach clinical care, clinicians would be expected to review responses before they reach patients to reduce the risk of the LLM providing inaccurate information.
Burnout among physicians is being fueled partly by patient electronic messaging. Recognizing this problem, a team of experts from NYU Langone Health came together to study whether large language models (LLMs), powered by artificial intelligence (AI), could be used to assist in responding to patient questions and reduce the clinician workload—focusing specifically on thyroid disease.
At the 2024 American Association of Endocrine Surgeons (AAES) meeting, Kepal N. Patel, MD, and Insoo Suh, MD, presented data supporting this idea. Not only were LLM responses to patient questions about thyroid disease accurate, they were more accurate, complete, empathetic, and consistent than physician responses.
“After ten hours of surgery, the last thing I want is to have to manually respond to 35 electronic messages,” says Dr. Patel, chief of the Division of Endocrine Surgery at NYU Langone. “As a busy surgeon, my goal is to be efficient with my administrative workload so I can focus on my patients and provide the best care.”
Endocrine surgery fellow Rajam Raghunathan, MD, served as lead author on the study and was awarded the Jon Van Heerden Award for Best Paper at the 2024 AAES meeting. The study is one of the first to evaluate LLM responses in thyroid disease.
Testing the LLMs
The study involved randomly selecting 33 patient questions about thyroid cancer or thyroid disease from the Reddit website’s AskDocs section, an online health community where users can post questions about personal medical conditions and receive answers from verified physicians.
The researchers then tested the ability of two different LLMs, ChatGPT-3.5 and GPT-4, to respond to these questions, comparing the LLM-generated responses with the verified physician responses from AskDocs. All responses were anonymized and scored for accuracy, quality, and empathy using a four-point Likert scale by seven blinded providers.
Of the selected questions, 39 percent focused on thyroid nodules, 30 percent concerned thyroid cancer or cancer worry, 24 percent addressed hypo/hyperthyroidism or goiter concerns, and 6 percent involved postoperative complications related to thyroid surgery.
Analyzing the Responses
The team highlighted that when interpreting differences in quality ratings, the overall organization of the response must be considered, explaining that LLM responses often include a summary with bullet points, addressing the concerns raised by the patient, then structuring a response thereafter.
In a sample electronic exchange about a high-risk TR5 thyroid nodule found on an ultrasound, a patient inquired about cancer risk. In this case, the online physician responded by encouraging the patient to get a biopsy for evaluation and potential treatment. However, the response lacked comprehensiveness, empathy, and synchronicity.
Verified-Physician Response |
---|
A TR5 nodule is considered relatively high-risk nodule and the risk of cancer is higher in these nodules. Either way, the nodule is small and they are recommending biopsy. I’d encourage you to get the biopsy done to further evaluate as while there’s an association, biopsy remains better at predicting actual risk in many cases. Depending on the pathology, they may recommend either half the thyroid be taken out, or the entire thing. In someone who is hyperthyroid there may be benefit to total thyroidectomy. |
In contrast, the LLM responses were more comprehensive, empathetic, and synchronous, evidenced by their length, style, and structure.
ChatGPT-3.5 |
---|
A TI-RADS 5 (TR5) nodule, especially one that is taller than wide and solid, can be concerning, but it doesn’t automatically mean it’s cancerous. The risk of malignancy is higher with TR5 nodules compared to lower TI-RADS categories. However, many TR5 nodules still turn out to be benign. The best way to determine whether the nodule is cancerous or not is through a fine-needle aspiration (FNA) biopsy. This procedure involves taking a small sample of tissue from the nodule and analyzing it for cancerous cells. The FNA results will provide more accurate information about the nodule’s nature. It’s essential to follow your healthcare provider’s recommendations and proceed with the FNA biopsy to get a definitive diagnosis. Thyroid cancer is generally treatable, especially when detected early, so timely evaluation and appropriate management are crucial. Your healthcare team will guide you through the necessary steps based on the biopsy results. |
GPT-4 |
---|
A TR5 (TIRADS 5) classification on thyroid ultrasound indicates a high suspicion for malignancy, but it’s important to remember that not all TR5 nodules are cancerous. The classification is based on certain ultrasound characteristics, like being hypoechoic and taller-than-wide, which are more commonly associated with thyroid cancer. However, the definitive diagnosis of cancer can only be made through a biopsy. In your case, the presence of hyperthyroidism symptoms and a significant weight loss, along with the ultrasound findings, warrant a thorough evaluation. The recommended fine-needle aspiration (FNA) biopsy of the hypoechoic nodule will provide more definitive information. FNA is a key tool in determining the nature of thyroid nodules, especially those with suspicious features. It’s understandable to be concerned about the possibility of cancer, but keep in mind that even in TR5 nodules, there are instances where the biopsy results come back benign. The outcome of the biopsy will guide the next steps in terms of treatment and management. Stay in close communication with your healthcare provider to discuss the biopsy results and the appropriate course of action based on those findings. Your medical team will consider your overall health, symptoms, and the biopsy results to provide the best care plan for you. |
Some experts have raised concerns about LLMs providing dangerous and false information in response to patient questions. Although 13 percent of online physician responses were flagged as containing dangerous or false information, this only occurred with 1.3 percent and 0.4 percent of responses for ChatGPT-3.5 and GPT-4, respectively.
“The results were striking,” says Dr. Suh, associate vice chair of surgical innovation and principal investigator of the study. “While LLMs or AI are not at a stage where they can replace us, they have a lot to offer in helping us perform better.”
Next Steps
For next steps, the researchers will prioritize the use of LLMs as a means to interact with patients seeking more information on thyroid nodules, thyroid cancer, and postoperative care instructions. Across all metrics, LLMs showed the best performance when answering questions related to these topics, Dr. Patel explains.
“We see this technology playing a critical role in helping physicians respond to patients promptly and effectively,” says Dr. Patel. “The surgeon will still review the responses before they are shared with the patient, ensuring there is minimal concern that the LLM is providing false or inaccurate information.”
Before broader implementation, the real-world safety, efficiency, and accessibility of LLMs need to be established. Future studies will test the ability of LLMs to appropriately triage and escalate patient safety concerns, as well as fact-check information and references.