Abstract
Introduction: This study aimed to evaluate the accuracy, completeness, precision, and readability of outputs generated by three large language models (LLMs); these are GPT by OpenAI, BARD by Google, and Bing by Microsoft, in comparison to patient education material on pelvic organ prolapse (POP) provided by the Royal College of Obstetricians and Gynecologists (RCOG). Methods: A total of 15 questions were retrieved from the RCOG website and input into the three LLMs. Two independent reviewers evaluated the outputs for accuracy, completeness, and precision. Readability was assessed using the Simplified Measure of Gobbledygook (SMOG) score and the Flesch-Kincaid Grade Level (FKGL) score. Results: Significant differences were observed in completeness and precision metrics. ChatGPT ranked highest in completeness (66.7%), while Bing led in precision (100%). No significant differences were observed in accuracy across all models. In terms of readability, ChatGPT exhibited higher difficulty than BARD, Bing, and the original RCOG answers. Conclusion: While all models displayed a variable degree of correctness, ChatGPT excelled in completeness, significantly surpassing BARD and Bing. However, Bing led in precision, providing the most relevant and concise answers. Regarding readability, ChatGPT exhibited higher difficulty. We observed that while all LLMs showed varying degrees of correctness in answering RCOG questions on patient information for POP, ChatGPT was the most comprehensive, but its answers were harder to read. Bing, on the other hand, was the most precise. The findings highlight the potential of LLMs in health information dissemination and the need for careful interpretation of their outputs.
Highlights of the Study
Studies have been performed to compare the content production performance of large language models (LLMs).
The analysis of the concordance of LLMs is in concordance with authoritative gynecology and obstetrics texts.
This establishes a new framework for evaluating the medical content of artificial intelligence.
This study also initiates readability evaluation of patient information generated by artificial intelligence in women’s health.
Introduction
Pelvic organ prolapse (POP) is a common manifestation of a weakened support system in the pelvic floor, which comprises a network of muscles, ligaments, and nerves [1]. Epidemiological data suggest that almost one in ten women undergo surgical operations for POP or urinary incontinence by the age of 80 [2]. Although POP is quite common, not all cases require intervention. Treatment options range from conservative approaches to various surgical procedures. The decision to pursue treatment depends on factors such as individual patient needs, the level of functional impairment experienced, and how it affects their overall quality of life [3].
Considering the potential impact of surgical choices on the well-being of patients, it is crucial to prioritize comprehensive patient education and informed decision-making [4]. While online sources may provide access to useful medical information, there is a corresponding risk of encountering biased or inaccurate content [5].
A significant recent development is the introduction of GPT by OpenAI, which represents one of the large language models (LLMs) [6]. These LLMs have been transformative and have found applications in various fields, such as disseminating health information. These models excel at generating relevant responses in natural language, making the content understandable and user-friendly. However, it is crucial to approach their outputs with caution as they are based on models that predict likely words or phrases, which sometimes raises concerns about the accuracy of the information provided [7]. In 2022, OpenAI introduced GPT-3.5, which was a big step forward in the world of LLMs [6]. This new model improved on earlier ones like BERT and GPT-2, making it even better at understanding and generating language [8]. With improved linguistic clarity and expanded text generation, GPT-3.5 has set new standards. LLMs hold significant potential to innovate the management and use of patient information. For instance, LLMs can revolutionize medical care, including cancer treatment, by serving as “virtual assistants” for patients and doctors. Moreover, LLMs like ChatGPT have been evaluated for their capabilities in generating new content and responding to inquiries in oncology, showcasing their applicability in enhancing patient and caregiver communication [9]. However, it is essential to approach this evolving technology with an understanding of its current developmental stage and the requisite prudence in its application within clinical settings.
ChatGPT, particularly in its GPT-4 version, is recognized as a leader in this field [10]. While numerous investigations have explored GPT’s contributions to patient education [11‒13], the capabilities and impact of other LLMs, including BARD and Bing, remain less documented in the research landscape [14]. Although Bing uses the GPT-4 model to improve its functionality, it is different in how it applies the model. Bing has customized the model to give concise and abstract answers that are suitable for a search engine.
Accuracy and the completeness of the outputs generated by LLMs
. | Large language model (LLM) . | . | ||
---|---|---|---|---|
. | GPT-4 . | BARD . | Bing . | p value¥ . |
RCOG questions (all), n (%) | ||||
Accuracy (n = 15) | 0.127 | |||
Correct | 14 (93.30) | 9 (60) | 12 (80) | |
Nearly all correct | 1 (6.70) | 5 (33.30) | 1 (6.70) | |
More correct than correct | 0 | 1 (3.70) | 2 (13.30) | |
Completeness (n = 15) | <0.001 | |||
Comprehensive | 10 (66.70)a | 2 (13.30)b | 3 (20)b | |
Adequate | 5 (33.30)a | 13 (86.70)b | 8 (53.30)a,b | |
Incomplete | 0a | 0a | 4 (26.70)a | |
RCOG questions (definition), n (%) | ||||
Accuracy (n = 4) | 0.273 | |||
Correct | 4 (100) | 3 (75) | 2 (50) | |
Nearly all correct | 0 | 1 (25) | 0 | |
More correct than correct | 0 | 0 | 2 (50) | |
Completeness (n = 4) | 0.709 | |||
Comprehensive | 1 (25) | 0 | 1 (25) | |
Adequate | 3 (75) | 4 (100) | 2 (50) | |
Incomplete | 0 | 0 | 1 (25) | |
RCOG questions (diagnostic), n (%) | ||||
Accuracy (n = 3) | 0.679 | |||
Correct | 3 (100) | 1 (33.30) | 2 (66.70) | |
Nearly all correct | 0 | 2 (66.70) | 1 (33.30) | |
Completeness (n = 3) | 0.679 | |||
Comprehensive | 2 (66.70) | 0 | 1 (33.30) | |
Adequate | 1 (33.30) | 3 (100) | 2 (66.70) | |
RCOG questions (treatment), n (%) | ||||
Accuracy (n = 8) | 0.273 | |||
Correct | 7 (87.50) | 5 (62.50) | 8 (100) | |
Nearly all correct | 1 (12.50) | 2 (25) | 0 | |
More correct than correct | 0 | 1 (12.50) | 0 | |
Completeness (n = 8) | 0.004 | |||
Comprehensive | 7 (87.50)a | 2 (25)b | 1 (12.50)b | |
Adequate | 1 (12.50)a | 6 (75)b | 4 (50)a,b | |
Incomplete | 0a | 0a | 3 (37.50)a |
. | Large language model (LLM) . | . | ||
---|---|---|---|---|
. | GPT-4 . | BARD . | Bing . | p value¥ . |
RCOG questions (all), n (%) | ||||
Accuracy (n = 15) | 0.127 | |||
Correct | 14 (93.30) | 9 (60) | 12 (80) | |
Nearly all correct | 1 (6.70) | 5 (33.30) | 1 (6.70) | |
More correct than correct | 0 | 1 (3.70) | 2 (13.30) | |
Completeness (n = 15) | <0.001 | |||
Comprehensive | 10 (66.70)a | 2 (13.30)b | 3 (20)b | |
Adequate | 5 (33.30)a | 13 (86.70)b | 8 (53.30)a,b | |
Incomplete | 0a | 0a | 4 (26.70)a | |
RCOG questions (definition), n (%) | ||||
Accuracy (n = 4) | 0.273 | |||
Correct | 4 (100) | 3 (75) | 2 (50) | |
Nearly all correct | 0 | 1 (25) | 0 | |
More correct than correct | 0 | 0 | 2 (50) | |
Completeness (n = 4) | 0.709 | |||
Comprehensive | 1 (25) | 0 | 1 (25) | |
Adequate | 3 (75) | 4 (100) | 2 (50) | |
Incomplete | 0 | 0 | 1 (25) | |
RCOG questions (diagnostic), n (%) | ||||
Accuracy (n = 3) | 0.679 | |||
Correct | 3 (100) | 1 (33.30) | 2 (66.70) | |
Nearly all correct | 0 | 2 (66.70) | 1 (33.30) | |
Completeness (n = 3) | 0.679 | |||
Comprehensive | 2 (66.70) | 0 | 1 (33.30) | |
Adequate | 1 (33.30) | 3 (100) | 2 (66.70) | |
RCOG questions (treatment), n (%) | ||||
Accuracy (n = 8) | 0.273 | |||
Correct | 7 (87.50) | 5 (62.50) | 8 (100) | |
Nearly all correct | 1 (12.50) | 2 (25) | 0 | |
More correct than correct | 0 | 1 (12.50) | 0 | |
Completeness (n = 8) | 0.004 | |||
Comprehensive | 7 (87.50)a | 2 (25)b | 1 (12.50)b | |
Adequate | 1 (12.50)a | 6 (75)b | 4 (50)a,b | |
Incomplete | 0a | 0a | 3 (37.50)a |
Each superscript letter (a, b and c) denotes a subset of LLM categories whose column proportions do not differ significantly from each other at the 5% level.
¥Fisher-Freeman-Halton test.
Precision ratios of the outputs generated by LLMs
. | Large language model (LLM) . | . | ||
---|---|---|---|---|
. | GPT-4 . | BARD . | Bing . | p value¥ . |
RCOG questions (all) | ||||
Precise (n = 15) | 10 (66.70)a | 6 (40)a | 15 (100)b | <0.001 |
RCOG questions (definition) | ||||
Precise (n = 4) | 0a | 0a | 4 (100)b | 0.006 |
RCOG questions (diagnostic) | ||||
Precise (n = 3) | 3 (100) | 1 (33.30) | 3 (100) | 0.250 |
RCOG questions (treatment) | ||||
Precise (n = 8) | 7 (87.50) | 5 (62.50) | 8 (100) | 0.273 |
. | Large language model (LLM) . | . | ||
---|---|---|---|---|
. | GPT-4 . | BARD . | Bing . | p value¥ . |
RCOG questions (all) | ||||
Precise (n = 15) | 10 (66.70)a | 6 (40)a | 15 (100)b | <0.001 |
RCOG questions (definition) | ||||
Precise (n = 4) | 0a | 0a | 4 (100)b | 0.006 |
RCOG questions (diagnostic) | ||||
Precise (n = 3) | 3 (100) | 1 (33.30) | 3 (100) | 0.250 |
RCOG questions (treatment) | ||||
Precise (n = 8) | 7 (87.50) | 5 (62.50) | 8 (100) | 0.273 |
Results are given as n (%). Each superscript letter (a, b and c) denotes a subset of LLM categories whose column proportions do not differ significantly from each other at the 5% level.
¥Fisher-Freeman-Halton test.
Readability scores of the outputs of the output generated by LLMs and RCOG reference guide
. | Readability score . | |
---|---|---|
RCOG questions (all) . | SMOG score . | FKG score . |
LLM | ||
GPT-4 (n = 15) | 13.02±1.11 | 13.99±1.25 |
BARD (n = 15) | 9.86±1.34 | 9.83±1.20 |
BING (n = 15) | 9.20±2.84 | 10.25±2.88 |
RCOG reference (n = 15) | 10.14±2.70 | 10.66±3.28 |
p valueŦ | <0.001 | <0.001 |
RCOG questions (definition) | ||
LLM | ||
GPT-4 (n = 4) | 13.05 (11.40–14.60) | 14.20 (13.30–16.50) |
BARD (n = 4) | 9.30 (8.30–10.80) | 9.75 (8.40–11.10) |
BING (n = 4) | 6.85 (5.50–8) | 8.65 (6.10–10.40) |
RCOG reference (n = 4) | 10.05 (8.90–15.80) | 10.15 (9.30–18.40) |
p valueƔ | 0.010 | 0.042 |
RCOG questions (treatment) | ||
LLM | ||
GPT-4 (n = 3) | 12.20 (11.90–13.20) | 13.30 (12.90–14.90) |
BARD (n = 3) | 8.70 (8.20–9.40) | 9.70 (8.30–10.30) |
BING (n = 3) | 7.60 (7.20–10.10) | 8.90 (7.90–11.10) |
RCOG reference (n = 3) | 8.30 (7.20–10.40) | 9.70 (7.80–11.40) |
p value | NA | NA |
RCOG questions (diagnostic) | ||
LLM | ||
GPT-4 (n = 8) | 13.24±1.18 | 13.83±1.28 |
BARD (n = 8) | 10.50±1.32 | 10.03±1.34 |
BING (n = 8) | 10.74±2.92 | 11.51±3.20 |
RCOG reference (n = 8) | 10.18±2.85 | 10.38±3.33 |
p valueƔ | 0.039 | 0.021 |
. | Readability score . | |
---|---|---|
RCOG questions (all) . | SMOG score . | FKG score . |
LLM | ||
GPT-4 (n = 15) | 13.02±1.11 | 13.99±1.25 |
BARD (n = 15) | 9.86±1.34 | 9.83±1.20 |
BING (n = 15) | 9.20±2.84 | 10.25±2.88 |
RCOG reference (n = 15) | 10.14±2.70 | 10.66±3.28 |
p valueŦ | <0.001 | <0.001 |
RCOG questions (definition) | ||
LLM | ||
GPT-4 (n = 4) | 13.05 (11.40–14.60) | 14.20 (13.30–16.50) |
BARD (n = 4) | 9.30 (8.30–10.80) | 9.75 (8.40–11.10) |
BING (n = 4) | 6.85 (5.50–8) | 8.65 (6.10–10.40) |
RCOG reference (n = 4) | 10.05 (8.90–15.80) | 10.15 (9.30–18.40) |
p valueƔ | 0.010 | 0.042 |
RCOG questions (treatment) | ||
LLM | ||
GPT-4 (n = 3) | 12.20 (11.90–13.20) | 13.30 (12.90–14.90) |
BARD (n = 3) | 8.70 (8.20–9.40) | 9.70 (8.30–10.30) |
BING (n = 3) | 7.60 (7.20–10.10) | 8.90 (7.90–11.10) |
RCOG reference (n = 3) | 8.30 (7.20–10.40) | 9.70 (7.80–11.40) |
p value | NA | NA |
RCOG questions (diagnostic) | ||
LLM | ||
GPT-4 (n = 8) | 13.24±1.18 | 13.83±1.28 |
BARD (n = 8) | 10.50±1.32 | 10.03±1.34 |
BING (n = 8) | 10.74±2.92 | 11.51±3.20 |
RCOG reference (n = 8) | 10.18±2.85 | 10.38±3.33 |
p valueƔ | 0.039 | 0.021 |
Data were represented as mean ± standard deviation or median (minimum-maximum) values.
SMOG, Simple Measure of Gobbledygook; FKG, the Flesh-Kincaid Grade Level;
NA, due to an insufficient number of questions, comparisons between LLM and RCOG references could not be performed.
ŦANOVA test.
ƔKruskal-Wallis test.
Several studies have been conducted on patient information using LLMs within various areas of medicine. These studies have yielded varying degrees of accuracy and quality [11‒13]. A study evaluating the efficacy of ChatGPT in providing patient education on thyroid nodules found that 69.2% of responses out of 120 questions were correct. The prompting significantly influenced the provision of correct responses and references [12]. Another study which explored the use of ChatGPT in creating patient education materials for implant-based breast reconstruction concluded that expert-generated content was more readable and accurate than that produced by ChatGPT [11]. Comparison of the readability of patient education materials from the American Society of Ophthalmic Plastic and Reconstructive Surgery with those generated by AI chatbots ChatGPT and Google Bard revealted that while ChatGPT’s unprompted patient education materials were less readable, both AI models significantly improved readability when prompted to produce materials at a 6th-grade level [13].
Within the specialty of Obstetrics and Gynecology, there have been studies highlighting the potential of LLMs, especially ChatGPT, in aiding diagnosis and clinical management. Grunebaum et al. [15] evaluated the ability of ChatGPT’s to handle clinical-related queries and found that ChatGPT’s performance was promising but could not truly understand the complexity of human language and conversation. A cross-sectional study assessed the diagnostic and therapeutic accuracy of ChatGPT; 30 cases spanning six different areas were evaluated both by a gynecologist and the ChatGPT model. ChatGPT could accurately diagnose and manage 27 out of the 30 cases. The model demonstrated high precision and offered coherent and informed responses, even when the diagnosis was incorrect [16].
To the best of our knowledge, no studies have evaluated the performance of any LLM in the context of patient education for POP. Our study aimed to evaluate the accuracy, completeness, and precision of outputs generated by different LLMs: GPT (GPT-4) by OpenAI, BARD by Google, and Bing by Microsoft. We also aimed to compare these LLM outputs with patient education material on POP provided by the Royal College of Obstetricians and Gynecologists (RCOG) and to compare the readability of the generated outputs and RCOG patient information.
Methods
Data Collection
A total of 15 questions were retrieved from https://www.rcog.org.uk/for-the-public/browse-our-patient-information/pelvic-organ-prolapse-patient-information-leaflet/. These questions were classified as “definition” (questions 1–4), “diagnostic” (questions 5–7), and “treatment” (questions 8–15). These questions were input into three LLMs: GPT by OpenAI, BARD by Google, and Bing by Microsoft in August 2023.
Review Process
Two independent reviewers evaluated the outputs from each LLM for accuracy, completeness, and precision. The first reviewer is a gynecologist with extensive experience in pelvic health, and the second reviewer is a urologist specializing in female pelvic medicine and reconstructive surgery. Both reviewers are based in a tertiary referral center setting. Following their assessments, the reviewers convened to reach a consensus on any discrepancies. To evaluate the consistency between the reviewers’ scores, the intraclass correlation coefficient was calculated, resulting in a score of 0.92 which indicates excellent reliability [17].
Evaluation Metrics
Accuracy
Correctness of the outputs was gauged using a scale proposed by Johnson et al. [18] as follows: 1 = completely incorrect; 2 = more incorrect than correct; 3 = approximately equal parts correct and incorrect; 4 = more correct than incorrect; 5 = nearly all correct; 6 = correct.
Completeness
The depth and thoroughness of the outputs were measured using the following criteria [18]: incomplete = addresses some aspects of the question, but significant parts are missing or incomplete; adequate = addresses all aspects of the question and provides the minimum amount of information required to be considered complete; comprehensive = addresses all aspects of the question and provides additional information or context beyond what was expected.
Precision
The precision evaluation was based on a binary criterion, “yes” or “no.” Accordingly, an output generated by LLM was deemed precise only if it exclusively provided relevant answers without any superfluous content. Responses including additional information beyond the direct answer were classified as imprecise.
Readability Assessment
Statistical Analysis
In the comparisons made between the LLM models according to the accuracy, completeness, and precision, the Fisher-Freeman-Halton test was used, and the Bonferroni corrected p value was taken into account in the subgroup analyses performed after general significance was obtained. The suitability of the readability scores for normal distribution has been examined with the Shapiro-Wilk test. If the readability scores conform to the normal distribution, the relevant scores are expressed with mean and standard deviation and compared using the ANOVA test. The Bonferroni test was used in the subgroup analyses performed after the ANOVA test. If the readability scores did not comply with the normal distribution after the normality test, the median, minimum, and maximum values were used as the descriptive statistics, and the comparison of the relevant scores was performed using the Kruskal-Wallis test. Subgroup analyses conducted after general significance were performed using the Dunn-Bonferroni test. Analysis of the study were conducted using SPSS (IBM Corp. Released 2015 IBM SPSS Statistics for Windows, Version 23.0, Armonk, NY: performed in IBM Corp.) program and the type I error rate were accepted as 5% in statistical comparisons.
Results
The results of the analysis, which evaluate the outputs generated by LLMs according to the metrics of accuracy and completeness, are presented in Table 1, while the precision ratios and related analysis results are shown in Table 2. In the context of the RCOG questions, the LLM models showed nonsignificant variations in accuracy (p = 0.127). Significant differences were observed in completeness and precision metrics. Regarding completeness, ChatGPT ranked highest with 66.7%, while Bing scored 20% and BARD scored 13.3% (p < 0.001). For precision, Bing achieved a perfect score of 100%, outperforming ChatGPT and BARD, which scored 66.7% and 40%, respectively (p < 0.001).
For RCOG questions of “definition,” the models exhibited nonsignificant differences in accuracy and completeness (p = 0.273 and p = 0.709, respectively). However, Bing significantly led in precision with a score of 100%, while both ChatGPT and BARD scored 0% (p = 0.006). Regarding RCOG questions of “diagnosis,” no significant variations were noted across all metrics (p = 0.679 for both accuracy and completeness and p = 0.250 for precision). In assessing RCOG questions of “treatment,” nonsignificant differences were observed in accuracy and precision (p = 0.273 for both). Nonetheless, ChatGPT excelled in completeness with a score of 87.5%, significantly surpassing BARD’s 25% and Bing’s 12.5% (p = 0.004).
Readability scores of the outputs generated by the LLMs and the RCOG reference guide are provided in Table 3. Our comprehensive analyses of questions derived from the RCOG showed that there are significant differences in readability scores (p < 0.001). ChatGPT had lower readability, scoring 13.02 ± 1.11 for SMOG and 13.99 ± 1.25 for FKG. BARD scored 9.86 ± 1.34 for SMOG and 9.83 ± 1.20 for FKG, while Bing had 9.20 ± 2.84 and 10.25 ± 2.88, respectively. The RCOG reference guide provided a baseline of 10.14 ± 2.70 for SMOG and 10.66 ± 3.28 for FKG. Additional subgroup analyses indicated that ChatGPT had higher readability than BARD (p = 0.001), Bing (p < 0.001), and RCOG reference (p = 0.003).
For the analysis of RCOG questions of “definition,” significant differences were found in readability scores (p = 0.010 and p = 0.042). ChatGPT had median scores of 13.05 and 14.20 for SMOG and FKG, respectively. BARD had scores of 9.30 and 9.75, and Bing scored 6.85 and 8.65 for these metrics. The RCOG reference in this subset had median scores of 10.05 and 10.15. The subgroup analyses determined that ChatGPT’s SMOG and FKG scores were higher than Bing, and other readability scores performed within the framework of subgroup analyses were not different. Due to limited data for RCOG questions of “diagnosis,” the analysis was inconclusive. For RCOG questions of “treatment,” although initial tests obtained general significance for SMOG scores (p = 0.039), subgroup analyses revealed no significant differences in SMOG scores among the LLM models and the RCOG reference. Chat GPT had 13.24 ± 1.18, BARD had 10.50 ± 1.32, Bing had 10.74 ± 2.92, and RCOG reference had 10.18 ± 2.85 for SMOG scores. A significant difference was obtained from FKG scores among LLMs and RCOG references. ChatGPT had a mean of 13.83 ± 1.28, BARD had a mean of 10.03 ± 1.34, Bing had a mean of 11.51 ± 3.20, and RCOG reference for this subset had a mean score of 10.38 ± 3.33. In subgroup analyses, it was determined that the content produced by ChatGPT was more readable than BARD (p = 0.016), and it was found that the FKG readability score did not differ in other analyses performed under subgroup analyses.
Discussion
We evaluated the performance of various LLMs in answering a set of RCOG questions on patient information for POP. The study results can benefit patients and clinicians in estimating the quality of outputs from various AI tools. Regarding accuracy, there was no significant difference across all LLM models in all questions or each category, although the rate of ChatGPT was higher. The proportion of correct answers generated by ChatGPT ranged between 87.50% and 100% across all questions and categories. In contrast, the accuracy of BARD varied from 33.30% to 75%, while that of Bing ranged from 50% to 100%. These data underscore the varying levels of accuracy among different LLM models in our study. We found that the three models displayed a variable degree of correctness without being consistently incorrect.
In the context of patient information, several studies have found conflicting results regarding the reliability of ChatGPT’s accuracy in various medical disciplines. One study assessed the accuracy and consistency of ChatGPT in responding to common questions about pediatric urology, using criteria from established guidelines and evaluations by expert urologists [21]. Results of this study showed high accuracy rates of 92–93.6% in answers, with no completely incorrect responses, indicating that ChatGPT could play a significant role in healthcare, particularly in pediatric urology [21]. On the other hand, another study found that the performance of ChatGPT was suboptimal for patient information on prostate cancer [22]. In a study assessing the accuracy of four LLMs, Claude, BARD, ChatGPT-4, and Bing in the domain of medical consultation and patient education regarding urolithiasis, Claude demonstrated the highest accuracy, with most answers scoring 4 points or higher on a 5-point Likert scale. ChatGPT-4 ranked second in terms of accuracy [23]. BARD’s responses exhibited the lowest accuracy, with most questions scoring 3 points or less. The study did not provide specific accuracy rates for Bing. The study further delved into the aspect of completeness, evaluating the completeness of the outputs generated by these LLMs. The findings revealed a significantly higher level of completeness in the responses provided by Claude and ChatGPT compared to BARD. Another study evaluating LLMs for ocular symptoms found that the responses generated by ChatGPT-3.5, ChatGPT-4.0, and Google BARD were equally comprehensive [24]. In the present study, the completeness rates of ChatGPT were significantly higher than those of the other models for all questions and subcategories of treatment. For all questions, the rate of responses generated by ChatGPT that were evaluated as comprehensive was 66.70%, compared to 13.30% for BARD and 20% for Bing. In the treatment subcategory, the rate of responses generated by ChatGPT evaluated as comprehensive was 87.50%, while 25% were for BARD and 12.50% were for Bing.
Another important aspect of the output may be its preciseness, referring to the correctness and direct relevance of the information provided in response to the query. Existing literature, while extensive on the completeness of LLMs, does not provide insights into their precision. In the present study, the outputs of Bing were 100%, GPT-4 were 66.7%, and BARD were 40% for all questions. There was a significant difference in the category and subcategory of definition for all questions. In addition to the accuracy, completeness, and precision of the output generated by LLMs, readability also plays an important role in ensuring effective communication of medical information to patients. Lim et al. [25] conducted a study to evaluate the effectiveness of LLMs in aiding patients with hand trauma nerve injuries. The researchers found that ChatGPT achieved the highest FKGL scores in terms of readability. However, no significant difference in FKGL scores was found between Bing and ChatGPT. Additionally, significant distinctions were observed in FKGL scores between BARD and Bing and between BARD and ChatGPT [25]. In their study, Xie et al. [26] assessed the efficacy of three distinct LLMs as clinical decision-support tools for physicians. The findings indicated that ChatGPT exhibited the highest scores in terms of readability. Our study used the SMOG and FKG metrics to assess readability. ChatGPT when compared to BARD, Bing, and the original RCOG answers exhibited higher difficulty in readability. This pattern was consistent across all question types, particularly in the “treatment” section, where ChatGPT displayed greater difficulty in readability than BARD, Bing, and the primary RCOG reference. We observed no significant differences in readability between other LLMs and the primary RCOG reference. Only Bing had lower readability than ChatGPT in the “definition” section.
The present study contributes a distinctive insight into gynecology and obstetrics as it evaluated and compared the outputs of LLMs with a reference text. It sets a precedent by rigorously analyzing the adherence of the generated contents to the established standard of patient education materials. Importantly, this study also introduces a novel assessment of the readability of these AI-generated materials, providing a benchmark for the clarity and comprehension of patient-directed information. The implications of these novel insights are substantial, signaling a potential shift in how medical professionals can leverage AI to enhance patient education and engagement in women’s health.
This study has a few limitations. One limitation is that the number of questions used to evaluate the LLMs was relatively small, which may limit the generalizability of our results.
Moreover, the subcategorization of these questions, while intended to provide detailed insights, may not have lent itself to robust statistical analysis due to the small sample within each category. Another limitation of this study is the absence of validated tools specifically crafted to evaluate the accuracy of AI-generated content for patient education. Recognizing this challenge, we have adapted our methodology in line with the protocols of Johnson et al. [18]. This limitation may potentially impact the reproducibility of our results. Another limitation of our study is the source of patient questions and concerns, primarily derived from standard medical literature and guidelines. While this provided a solid foundation for assessing the AI’s capabilities, the inclusion of frequently asked questions from large healthcare institutions or patient queries from social media support groups was not pursued. This decision may have restricted our insight into the proficiency of AI with more diverse, novel, or nontraditional patient inquiries that are increasingly prevalent in digital forums. LLMs are dynamic in nature, which means their capabilities can change significantly within short periods of time. This could potentially make our observations less relevant over time. Our analysis primarily emphasized accuracy, completeness, and readability. However, crucial aspects such as cultural sensitivity and patient-centricity were not addressed. Future research should evaluate the cultural sensitivity and patient-centric nature of LLM outputs. Understanding how these models cater to diverse populations and ensuring accurate and patient-focused information will be pivotal in the broader application of LLMs in medical practice.
Conclusions
This study contributes to the literature by evaluating the performance of various LLMs in providing patient education on POP, a topic that has not been extensively explored in existing literature on LLM. We observed that while all models displayed a variable degree of correctness, ChatGPT excelled in completeness, significantly surpassing BARD and Bing. However, Bing led in precision, providing the most relevant and concise answers. Regarding readability, ChatGPT exhibited higher difficulty than BARD, Bing, and the original RCOG answers. This suggests that while ChatGPT may provide more comprehensive answers, these answers may be more challenging for patients to understand. These findings underscore the potential of LLMs as tools for disseminating health information. However, they also highlight the need for caution in interpreting their outputs due to the variability in accuracy, completeness, and readability. As LLMs continue to evolve and improve, further research is needed to assess their utility and reliability in the medical field.
Statement of Ethics
This study did not require ethical approval per institutional and national guidelines as it involved evaluating publicly available information and did not include human or animal subjects, and it did not fall within the purview of ethics review.
Conflict of Interest Statement
The authors have no relevant financial or nonfinancial interests to disclose.
Funding Sources
None.
Author Contributions
Sakine Rahimli Ocakoglu: conceptualization, investigation, validation, writing – original draft, writing – review and editing, data collection, results, discussion, literature review, and writing. Burhan Coskun: data curation, formal analysis, investigation, methodology, validation, original draft, and writing review and editing.
Data Availability Statement
Data are available upon request.