Introduction: This study aimed to evaluate the accuracy, completeness, precision, and readability of outputs generated by three large language models (LLMs); these are GPT by OpenAI, BARD by Google, and Bing by Microsoft, in comparison to patient education material on pelvic organ prolapse (POP) provided by the Royal College of Obstetricians and Gynecologists (RCOG). Methods: A total of 15 questions were retrieved from the RCOG website and input into the three LLMs. Two independent reviewers evaluated the outputs for accuracy, completeness, and precision. Readability was assessed using the Simplified Measure of Gobbledygook (SMOG) score and the Flesch-Kincaid Grade Level (FKGL) score. Results: Significant differences were observed in completeness and precision metrics. ChatGPT ranked highest in completeness (66.7%), while Bing led in precision (100%). No significant differences were observed in accuracy across all models. In terms of readability, ChatGPT exhibited higher difficulty than BARD, Bing, and the original RCOG answers. Conclusion: While all models displayed a variable degree of correctness, ChatGPT excelled in completeness, significantly surpassing BARD and Bing. However, Bing led in precision, providing the most relevant and concise answers. Regarding readability, ChatGPT exhibited higher difficulty. We observed that while all LLMs showed varying degrees of correctness in answering RCOG questions on patient information for POP, ChatGPT was the most comprehensive, but its answers were harder to read. Bing, on the other hand, was the most precise. The findings highlight the potential of LLMs in health information dissemination and the need for careful interpretation of their outputs.

Highlights of the Study

  • Studies have been performed to compare the content production performance of large language models (LLMs).

  • The analysis of the concordance of LLMs is in concordance with authoritative gynecology and obstetrics texts.

  • This establishes a new framework for evaluating the medical content of artificial intelligence.

  • This study also initiates readability evaluation of patient information generated by artificial intelligence in women’s health.

Pelvic organ prolapse (POP) is a common manifestation of a weakened support system in the pelvic floor, which comprises a network of muscles, ligaments, and nerves [1]. Epidemiological data suggest that almost one in ten women undergo surgical operations for POP or urinary incontinence by the age of 80 [2]. Although POP is quite common, not all cases require intervention. Treatment options range from conservative approaches to various surgical procedures. The decision to pursue treatment depends on factors such as individual patient needs, the level of functional impairment experienced, and how it affects their overall quality of life [3].

Considering the potential impact of surgical choices on the well-being of patients, it is crucial to prioritize comprehensive patient education and informed decision-making [4]. While online sources may provide access to useful medical information, there is a corresponding risk of encountering biased or inaccurate content [5].

A significant recent development is the introduction of GPT by OpenAI, which represents one of the large language models (LLMs) [6]. These LLMs have been transformative and have found applications in various fields, such as disseminating health information. These models excel at generating relevant responses in natural language, making the content understandable and user-friendly. However, it is crucial to approach their outputs with caution as they are based on models that predict likely words or phrases, which sometimes raises concerns about the accuracy of the information provided [7]. In 2022, OpenAI introduced GPT-3.5, which was a big step forward in the world of LLMs [6]. This new model improved on earlier ones like BERT and GPT-2, making it even better at understanding and generating language [8]. With improved linguistic clarity and expanded text generation, GPT-3.5 has set new standards. LLMs hold significant potential to innovate the management and use of patient information. For instance, LLMs can revolutionize medical care, including cancer treatment, by serving as “virtual assistants” for patients and doctors. Moreover, LLMs like ChatGPT have been evaluated for their capabilities in generating new content and responding to inquiries in oncology, showcasing their applicability in enhancing patient and caregiver communication [9]. However, it is essential to approach this evolving technology with an understanding of its current developmental stage and the requisite prudence in its application within clinical settings.

ChatGPT, particularly in its GPT-4 version, is recognized as a leader in this field [10]. While numerous investigations have explored GPT’s contributions to patient education [11‒13], the capabilities and impact of other LLMs, including BARD and Bing, remain less documented in the research landscape [14]. Although Bing uses the GPT-4 model to improve its functionality, it is different in how it applies the model. Bing has customized the model to give concise and abstract answers that are suitable for a search engine.

Table 1.

Accuracy and the completeness of the outputs generated by LLMs

Large language model (LLM)
GPT-4BARDBingp value¥
RCOG questions (all), n (%) 
 Accuracy (n = 15)    0.127 
  Correct 14 (93.30) 9 (60) 12 (80)  
  Nearly all correct 1 (6.70) 5 (33.30) 1 (6.70)  
  More correct than correct 1 (3.70) 2 (13.30)  
 Completeness (n = 15)    <0.001 
  Comprehensive 10 (66.70)a 2 (13.30)b 3 (20)b  
  Adequate 5 (33.30)a 13 (86.70)b 8 (53.30)a,b  
  Incomplete 0a 0a 4 (26.70)a  
RCOG questions (definition), n (%) 
 Accuracy (n = 4)    0.273 
  Correct 4 (100) 3 (75) 2 (50)  
  Nearly all correct 1 (25)  
  More correct than correct 2 (50)  
 Completeness (n = 4)    0.709 
  Comprehensive 1 (25) 1 (25)  
  Adequate 3 (75) 4 (100) 2 (50)  
  Incomplete 1 (25)  
RCOG questions (diagnostic), n (%) 
 Accuracy (n = 3)    0.679 
  Correct 3 (100) 1 (33.30) 2 (66.70)  
  Nearly all correct 2 (66.70) 1 (33.30)  
 Completeness (n = 3)    0.679 
  Comprehensive 2 (66.70) 1 (33.30)  
  Adequate 1 (33.30) 3 (100) 2 (66.70)  
RCOG questions (treatment), n (%) 
 Accuracy (n = 8)    0.273 
  Correct 7 (87.50) 5 (62.50) 8 (100)  
  Nearly all correct 1 (12.50) 2 (25)  
  More correct than correct 1 (12.50)  
 Completeness (n = 8)    0.004 
  Comprehensive 7 (87.50)a 2 (25)b 1 (12.50)b  
  Adequate 1 (12.50)a 6 (75)b 4 (50)a,b  
  Incomplete 0a 0a 3 (37.50)a  
Large language model (LLM)
GPT-4BARDBingp value¥
RCOG questions (all), n (%) 
 Accuracy (n = 15)    0.127 
  Correct 14 (93.30) 9 (60) 12 (80)  
  Nearly all correct 1 (6.70) 5 (33.30) 1 (6.70)  
  More correct than correct 1 (3.70) 2 (13.30)  
 Completeness (n = 15)    <0.001 
  Comprehensive 10 (66.70)a 2 (13.30)b 3 (20)b  
  Adequate 5 (33.30)a 13 (86.70)b 8 (53.30)a,b  
  Incomplete 0a 0a 4 (26.70)a  
RCOG questions (definition), n (%) 
 Accuracy (n = 4)    0.273 
  Correct 4 (100) 3 (75) 2 (50)  
  Nearly all correct 1 (25)  
  More correct than correct 2 (50)  
 Completeness (n = 4)    0.709 
  Comprehensive 1 (25) 1 (25)  
  Adequate 3 (75) 4 (100) 2 (50)  
  Incomplete 1 (25)  
RCOG questions (diagnostic), n (%) 
 Accuracy (n = 3)    0.679 
  Correct 3 (100) 1 (33.30) 2 (66.70)  
  Nearly all correct 2 (66.70) 1 (33.30)  
 Completeness (n = 3)    0.679 
  Comprehensive 2 (66.70) 1 (33.30)  
  Adequate 1 (33.30) 3 (100) 2 (66.70)  
RCOG questions (treatment), n (%) 
 Accuracy (n = 8)    0.273 
  Correct 7 (87.50) 5 (62.50) 8 (100)  
  Nearly all correct 1 (12.50) 2 (25)  
  More correct than correct 1 (12.50)  
 Completeness (n = 8)    0.004 
  Comprehensive 7 (87.50)a 2 (25)b 1 (12.50)b  
  Adequate 1 (12.50)a 6 (75)b 4 (50)a,b  
  Incomplete 0a 0a 3 (37.50)a  

Each superscript letter (a, b and c) denotes a subset of LLM categories whose column proportions do not differ significantly from each other at the 5% level.

¥Fisher-Freeman-Halton test.

Table 2.

Precision ratios of the outputs generated by LLMs

Large language model (LLM)
GPT-4BARDBingp value¥
RCOG questions (all) 
 Precise (n = 15) 10 (66.70)a 6 (40)a 15 (100)b <0.001 
RCOG questions (definition) 
 Precise (n = 4) 0a 0a 4 (100)b 0.006 
RCOG questions (diagnostic) 
 Precise (n = 3) 3 (100) 1 (33.30) 3 (100) 0.250 
RCOG questions (treatment) 
 Precise (n = 8) 7 (87.50) 5 (62.50) 8 (100) 0.273 
Large language model (LLM)
GPT-4BARDBingp value¥
RCOG questions (all) 
 Precise (n = 15) 10 (66.70)a 6 (40)a 15 (100)b <0.001 
RCOG questions (definition) 
 Precise (n = 4) 0a 0a 4 (100)b 0.006 
RCOG questions (diagnostic) 
 Precise (n = 3) 3 (100) 1 (33.30) 3 (100) 0.250 
RCOG questions (treatment) 
 Precise (n = 8) 7 (87.50) 5 (62.50) 8 (100) 0.273 

Results are given as n (%). Each superscript letter (a, b and c) denotes a subset of LLM categories whose column proportions do not differ significantly from each other at the 5% level.

¥Fisher-Freeman-Halton test.

Table 3.

Readability scores of the outputs of the output generated by LLMs and RCOG reference guide

Readability score
RCOG questions (all)SMOG scoreFKG score
LLM 
 GPT-4 (n = 15) 13.02±1.11 13.99±1.25 
 BARD (n = 15) 9.86±1.34 9.83±1.20 
 BING (n = 15) 9.20±2.84 10.25±2.88 
RCOG reference (n = 15) 10.14±2.70 10.66±3.28 
p valueŦ <0.001 <0.001 
RCOG questions (definition) 
 LLM 
  GPT-4 (n = 4) 13.05 (11.40–14.60) 14.20 (13.30–16.50) 
  BARD (n = 4) 9.30 (8.30–10.80) 9.75 (8.40–11.10) 
  BING (n = 4) 6.85 (5.50–8) 8.65 (6.10–10.40) 
RCOG reference (n = 4) 10.05 (8.90–15.80) 10.15 (9.30–18.40) 
p valueƔ 0.010 0.042 
RCOG questions (treatment) 
 LLM 
  GPT-4 (n = 3) 12.20 (11.90–13.20) 13.30 (12.90–14.90) 
  BARD (n = 3) 8.70 (8.20–9.40) 9.70 (8.30–10.30) 
  BING (n = 3) 7.60 (7.20–10.10) 8.90 (7.90–11.10) 
RCOG reference (n = 3) 8.30 (7.20–10.40) 9.70 (7.80–11.40) 
p value NA NA 
RCOG questions (diagnostic) 
 LLM 
  GPT-4 (n = 8) 13.24±1.18 13.83±1.28 
  BARD (n = 8) 10.50±1.32 10.03±1.34 
  BING (n = 8) 10.74±2.92 11.51±3.20 
RCOG reference (n = 8) 10.18±2.85 10.38±3.33 
p valueƔ 0.039 0.021 
Readability score
RCOG questions (all)SMOG scoreFKG score
LLM 
 GPT-4 (n = 15) 13.02±1.11 13.99±1.25 
 BARD (n = 15) 9.86±1.34 9.83±1.20 
 BING (n = 15) 9.20±2.84 10.25±2.88 
RCOG reference (n = 15) 10.14±2.70 10.66±3.28 
p valueŦ <0.001 <0.001 
RCOG questions (definition) 
 LLM 
  GPT-4 (n = 4) 13.05 (11.40–14.60) 14.20 (13.30–16.50) 
  BARD (n = 4) 9.30 (8.30–10.80) 9.75 (8.40–11.10) 
  BING (n = 4) 6.85 (5.50–8) 8.65 (6.10–10.40) 
RCOG reference (n = 4) 10.05 (8.90–15.80) 10.15 (9.30–18.40) 
p valueƔ 0.010 0.042 
RCOG questions (treatment) 
 LLM 
  GPT-4 (n = 3) 12.20 (11.90–13.20) 13.30 (12.90–14.90) 
  BARD (n = 3) 8.70 (8.20–9.40) 9.70 (8.30–10.30) 
  BING (n = 3) 7.60 (7.20–10.10) 8.90 (7.90–11.10) 
RCOG reference (n = 3) 8.30 (7.20–10.40) 9.70 (7.80–11.40) 
p value NA NA 
RCOG questions (diagnostic) 
 LLM 
  GPT-4 (n = 8) 13.24±1.18 13.83±1.28 
  BARD (n = 8) 10.50±1.32 10.03±1.34 
  BING (n = 8) 10.74±2.92 11.51±3.20 
RCOG reference (n = 8) 10.18±2.85 10.38±3.33 
p valueƔ 0.039 0.021 

Data were represented as mean ± standard deviation or median (minimum-maximum) values.

SMOG, Simple Measure of Gobbledygook; FKG, the Flesh-Kincaid Grade Level;

NA, due to an insufficient number of questions, comparisons between LLM and RCOG references could not be performed.

ŦANOVA test.

ƔKruskal-Wallis test.

Several studies have been conducted on patient information using LLMs within various areas of medicine. These studies have yielded varying degrees of accuracy and quality [11‒13]. A study evaluating the efficacy of ChatGPT in providing patient education on thyroid nodules found that 69.2% of responses out of 120 questions were correct. The prompting significantly influenced the provision of correct responses and references [12]. Another study which explored the use of ChatGPT in creating patient education materials for implant-based breast reconstruction concluded that expert-generated content was more readable and accurate than that produced by ChatGPT [11]. Comparison of the readability of patient education materials from the American Society of Ophthalmic Plastic and Reconstructive Surgery with those generated by AI chatbots ChatGPT and Google Bard revealted that while ChatGPT’s unprompted patient education materials were less readable, both AI models significantly improved readability when prompted to produce materials at a 6th-grade level [13].

Within the specialty of Obstetrics and Gynecology, there have been studies highlighting the potential of LLMs, especially ChatGPT, in aiding diagnosis and clinical management. Grunebaum et al. [15] evaluated the ability of ChatGPT’s to handle clinical-related queries and found that ChatGPT’s performance was promising but could not truly understand the complexity of human language and conversation. A cross-sectional study assessed the diagnostic and therapeutic accuracy of ChatGPT; 30 cases spanning six different areas were evaluated both by a gynecologist and the ChatGPT model. ChatGPT could accurately diagnose and manage 27 out of the 30 cases. The model demonstrated high precision and offered coherent and informed responses, even when the diagnosis was incorrect [16].

To the best of our knowledge, no studies have evaluated the performance of any LLM in the context of patient education for POP. Our study aimed to evaluate the accuracy, completeness, and precision of outputs generated by different LLMs: GPT (GPT-4) by OpenAI, BARD by Google, and Bing by Microsoft. We also aimed to compare these LLM outputs with patient education material on POP provided by the Royal College of Obstetricians and Gynecologists (RCOG) and to compare the readability of the generated outputs and RCOG patient information.

Data Collection

A total of 15 questions were retrieved from https://www.rcog.org.uk/for-the-public/browse-our-patient-information/pelvic-organ-prolapse-patient-information-leaflet/. These questions were classified as “definition” (questions 1–4), “diagnostic” (questions 5–7), and “treatment” (questions 8–15). These questions were input into three LLMs: GPT by OpenAI, BARD by Google, and Bing by Microsoft in August 2023.

Review Process

Two independent reviewers evaluated the outputs from each LLM for accuracy, completeness, and precision. The first reviewer is a gynecologist with extensive experience in pelvic health, and the second reviewer is a urologist specializing in female pelvic medicine and reconstructive surgery. Both reviewers are based in a tertiary referral center setting. Following their assessments, the reviewers convened to reach a consensus on any discrepancies. To evaluate the consistency between the reviewers’ scores, the intraclass correlation coefficient was calculated, resulting in a score of 0.92 which indicates excellent reliability [17].

Evaluation Metrics

Accuracy

Correctness of the outputs was gauged using a scale proposed by Johnson et al. [18] as follows: 1 = completely incorrect; 2 = more incorrect than correct; 3 = approximately equal parts correct and incorrect; 4 = more correct than incorrect; 5 = nearly all correct; 6 = correct.

Completeness

The depth and thoroughness of the outputs were measured using the following criteria [18]: incomplete = addresses some aspects of the question, but significant parts are missing or incomplete; adequate = addresses all aspects of the question and provides the minimum amount of information required to be considered complete; comprehensive = addresses all aspects of the question and provides additional information or context beyond what was expected.

Precision

The precision evaluation was based on a binary criterion, “yes” or “no.” Accordingly, an output generated by LLM was deemed precise only if it exclusively provided relevant answers without any superfluous content. Responses including additional information beyond the direct answer were classified as imprecise.

Readability Assessment

The understandability of the outputs was quantified using the Simplified Measure of Gobbledygook (SMOG) score [19] and the Flesch-Kincaid Grade Level (FKGL) score [20].

Statistical Analysis

In the comparisons made between the LLM models according to the accuracy, completeness, and precision, the Fisher-Freeman-Halton test was used, and the Bonferroni corrected p value was taken into account in the subgroup analyses performed after general significance was obtained. The suitability of the readability scores for normal distribution has been examined with the Shapiro-Wilk test. If the readability scores conform to the normal distribution, the relevant scores are expressed with mean and standard deviation and compared using the ANOVA test. The Bonferroni test was used in the subgroup analyses performed after the ANOVA test. If the readability scores did not comply with the normal distribution after the normality test, the median, minimum, and maximum values were used as the descriptive statistics, and the comparison of the relevant scores was performed using the Kruskal-Wallis test. Subgroup analyses conducted after general significance were performed using the Dunn-Bonferroni test. Analysis of the study were conducted using SPSS (IBM Corp. Released 2015 IBM SPSS Statistics for Windows, Version 23.0, Armonk, NY: performed in IBM Corp.) program and the type I error rate were accepted as 5% in statistical comparisons.

The results of the analysis, which evaluate the outputs generated by LLMs according to the metrics of accuracy and completeness, are presented in Table 1, while the precision ratios and related analysis results are shown in Table 2. In the context of the RCOG questions, the LLM models showed nonsignificant variations in accuracy (p = 0.127). Significant differences were observed in completeness and precision metrics. Regarding completeness, ChatGPT ranked highest with 66.7%, while Bing scored 20% and BARD scored 13.3% (p < 0.001). For precision, Bing achieved a perfect score of 100%, outperforming ChatGPT and BARD, which scored 66.7% and 40%, respectively (p < 0.001).

For RCOG questions of “definition,” the models exhibited nonsignificant differences in accuracy and completeness (p = 0.273 and p = 0.709, respectively). However, Bing significantly led in precision with a score of 100%, while both ChatGPT and BARD scored 0% (p = 0.006). Regarding RCOG questions of “diagnosis,” no significant variations were noted across all metrics (p = 0.679 for both accuracy and completeness and p = 0.250 for precision). In assessing RCOG questions of “treatment,” nonsignificant differences were observed in accuracy and precision (p = 0.273 for both). Nonetheless, ChatGPT excelled in completeness with a score of 87.5%, significantly surpassing BARD’s 25% and Bing’s 12.5% (p = 0.004).

Readability scores of the outputs generated by the LLMs and the RCOG reference guide are provided in Table 3. Our comprehensive analyses of questions derived from the RCOG showed that there are significant differences in readability scores (p < 0.001). ChatGPT had lower readability, scoring 13.02 ± 1.11 for SMOG and 13.99 ± 1.25 for FKG. BARD scored 9.86 ± 1.34 for SMOG and 9.83 ± 1.20 for FKG, while Bing had 9.20 ± 2.84 and 10.25 ± 2.88, respectively. The RCOG reference guide provided a baseline of 10.14 ± 2.70 for SMOG and 10.66 ± 3.28 for FKG. Additional subgroup analyses indicated that ChatGPT had higher readability than BARD (p = 0.001), Bing (p < 0.001), and RCOG reference (p = 0.003).

For the analysis of RCOG questions of “definition,” significant differences were found in readability scores (p = 0.010 and p = 0.042). ChatGPT had median scores of 13.05 and 14.20 for SMOG and FKG, respectively. BARD had scores of 9.30 and 9.75, and Bing scored 6.85 and 8.65 for these metrics. The RCOG reference in this subset had median scores of 10.05 and 10.15. The subgroup analyses determined that ChatGPT’s SMOG and FKG scores were higher than Bing, and other readability scores performed within the framework of subgroup analyses were not different. Due to limited data for RCOG questions of “diagnosis,” the analysis was inconclusive. For RCOG questions of “treatment,” although initial tests obtained general significance for SMOG scores (p = 0.039), subgroup analyses revealed no significant differences in SMOG scores among the LLM models and the RCOG reference. Chat GPT had 13.24 ± 1.18, BARD had 10.50 ± 1.32, Bing had 10.74 ± 2.92, and RCOG reference had 10.18 ± 2.85 for SMOG scores. A significant difference was obtained from FKG scores among LLMs and RCOG references. ChatGPT had a mean of 13.83 ± 1.28, BARD had a mean of 10.03 ± 1.34, Bing had a mean of 11.51 ± 3.20, and RCOG reference for this subset had a mean score of 10.38 ± 3.33. In subgroup analyses, it was determined that the content produced by ChatGPT was more readable than BARD (p = 0.016), and it was found that the FKG readability score did not differ in other analyses performed under subgroup analyses.

We evaluated the performance of various LLMs in answering a set of RCOG questions on patient information for POP. The study results can benefit patients and clinicians in estimating the quality of outputs from various AI tools. Regarding accuracy, there was no significant difference across all LLM models in all questions or each category, although the rate of ChatGPT was higher. The proportion of correct answers generated by ChatGPT ranged between 87.50% and 100% across all questions and categories. In contrast, the accuracy of BARD varied from 33.30% to 75%, while that of Bing ranged from 50% to 100%. These data underscore the varying levels of accuracy among different LLM models in our study. We found that the three models displayed a variable degree of correctness without being consistently incorrect.

In the context of patient information, several studies have found conflicting results regarding the reliability of ChatGPT’s accuracy in various medical disciplines. One study assessed the accuracy and consistency of ChatGPT in responding to common questions about pediatric urology, using criteria from established guidelines and evaluations by expert urologists [21]. Results of this study showed high accuracy rates of 92–93.6% in answers, with no completely incorrect responses, indicating that ChatGPT could play a significant role in healthcare, particularly in pediatric urology [21]. On the other hand, another study found that the performance of ChatGPT was suboptimal for patient information on prostate cancer [22]. In a study assessing the accuracy of four LLMs, Claude, BARD, ChatGPT-4, and Bing in the domain of medical consultation and patient education regarding urolithiasis, Claude demonstrated the highest accuracy, with most answers scoring 4 points or higher on a 5-point Likert scale. ChatGPT-4 ranked second in terms of accuracy [23]. BARD’s responses exhibited the lowest accuracy, with most questions scoring 3 points or less. The study did not provide specific accuracy rates for Bing. The study further delved into the aspect of completeness, evaluating the completeness of the outputs generated by these LLMs. The findings revealed a significantly higher level of completeness in the responses provided by Claude and ChatGPT compared to BARD. Another study evaluating LLMs for ocular symptoms found that the responses generated by ChatGPT-3.5, ChatGPT-4.0, and Google BARD were equally comprehensive [24]. In the present study, the completeness rates of ChatGPT were significantly higher than those of the other models for all questions and subcategories of treatment. For all questions, the rate of responses generated by ChatGPT that were evaluated as comprehensive was 66.70%, compared to 13.30% for BARD and 20% for Bing. In the treatment subcategory, the rate of responses generated by ChatGPT evaluated as comprehensive was 87.50%, while 25% were for BARD and 12.50% were for Bing.

Another important aspect of the output may be its preciseness, referring to the correctness and direct relevance of the information provided in response to the query. Existing literature, while extensive on the completeness of LLMs, does not provide insights into their precision. In the present study, the outputs of Bing were 100%, GPT-4 were 66.7%, and BARD were 40% for all questions. There was a significant difference in the category and subcategory of definition for all questions. In addition to the accuracy, completeness, and precision of the output generated by LLMs, readability also plays an important role in ensuring effective communication of medical information to patients. Lim et al. [25] conducted a study to evaluate the effectiveness of LLMs in aiding patients with hand trauma nerve injuries. The researchers found that ChatGPT achieved the highest FKGL scores in terms of readability. However, no significant difference in FKGL scores was found between Bing and ChatGPT. Additionally, significant distinctions were observed in FKGL scores between BARD and Bing and between BARD and ChatGPT [25]. In their study, Xie et al. [26] assessed the efficacy of three distinct LLMs as clinical decision-support tools for physicians. The findings indicated that ChatGPT exhibited the highest scores in terms of readability. Our study used the SMOG and FKG metrics to assess readability. ChatGPT when compared to BARD, Bing, and the original RCOG answers exhibited higher difficulty in readability. This pattern was consistent across all question types, particularly in the “treatment” section, where ChatGPT displayed greater difficulty in readability than BARD, Bing, and the primary RCOG reference. We observed no significant differences in readability between other LLMs and the primary RCOG reference. Only Bing had lower readability than ChatGPT in the “definition” section.

The present study contributes a distinctive insight into gynecology and obstetrics as it evaluated and compared the outputs of LLMs with a reference text. It sets a precedent by rigorously analyzing the adherence of the generated contents to the established standard of patient education materials. Importantly, this study also introduces a novel assessment of the readability of these AI-generated materials, providing a benchmark for the clarity and comprehension of patient-directed information. The implications of these novel insights are substantial, signaling a potential shift in how medical professionals can leverage AI to enhance patient education and engagement in women’s health.

This study has a few limitations. One limitation is that the number of questions used to evaluate the LLMs was relatively small, which may limit the generalizability of our results.

Moreover, the subcategorization of these questions, while intended to provide detailed insights, may not have lent itself to robust statistical analysis due to the small sample within each category. Another limitation of this study is the absence of validated tools specifically crafted to evaluate the accuracy of AI-generated content for patient education. Recognizing this challenge, we have adapted our methodology in line with the protocols of Johnson et al. [18]. This limitation may potentially impact the reproducibility of our results. Another limitation of our study is the source of patient questions and concerns, primarily derived from standard medical literature and guidelines. While this provided a solid foundation for assessing the AI’s capabilities, the inclusion of frequently asked questions from large healthcare institutions or patient queries from social media support groups was not pursued. This decision may have restricted our insight into the proficiency of AI with more diverse, novel, or nontraditional patient inquiries that are increasingly prevalent in digital forums. LLMs are dynamic in nature, which means their capabilities can change significantly within short periods of time. This could potentially make our observations less relevant over time. Our analysis primarily emphasized accuracy, completeness, and readability. However, crucial aspects such as cultural sensitivity and patient-centricity were not addressed. Future research should evaluate the cultural sensitivity and patient-centric nature of LLM outputs. Understanding how these models cater to diverse populations and ensuring accurate and patient-focused information will be pivotal in the broader application of LLMs in medical practice.

This study contributes to the literature by evaluating the performance of various LLMs in providing patient education on POP, a topic that has not been extensively explored in existing literature on LLM. We observed that while all models displayed a variable degree of correctness, ChatGPT excelled in completeness, significantly surpassing BARD and Bing. However, Bing led in precision, providing the most relevant and concise answers. Regarding readability, ChatGPT exhibited higher difficulty than BARD, Bing, and the original RCOG answers. This suggests that while ChatGPT may provide more comprehensive answers, these answers may be more challenging for patients to understand. These findings underscore the potential of LLMs as tools for disseminating health information. However, they also highlight the need for caution in interpreting their outputs due to the variability in accuracy, completeness, and readability. As LLMs continue to evolve and improve, further research is needed to assess their utility and reliability in the medical field.

This study did not require ethical approval per institutional and national guidelines as it involved evaluating publicly available information and did not include human or animal subjects, and it did not fall within the purview of ethics review.

The authors have no relevant financial or nonfinancial interests to disclose.

None.

Sakine Rahimli Ocakoglu: conceptualization, investigation, validation, writing – original draft, writing – review and editing, data collection, results, discussion, literature review, and writing. Burhan Coskun: data curation, formal analysis, investigation, methodology, validation, original draft, and writing review and editing.

Data are available upon request.

1.
DeLancey
JO
.
What’s new in the functional anatomy of pelvic organ prolapse
.
Curr Opin Obstet Gynecol
.
2016
;
28
(
5
):
420
9
.
2.
Olsen
AL
,
Smith
VJ
,
Bergstrom
JO
,
Colling
JC
,
Clark
AL
.
Epidemiology of surgically managed pelvic organ prolapse and urinary incontinence
.
Obstet Gynecol
.
1997
;
89
(
4
):
501
6
.
3.
Chung
SH
,
Kim
WB
.
Various approaches and treatments for pelvic organ prolapse in women
.
J Menopausal Med
.
2018
;
24
(
3
):
155
62
.
4.
Drost
LE
,
Stegeman
M
,
Gerritse
MB
,
Franx
A
,
Vos
MC
;
SHADE-POP study group
, et al
.
A web-based decision aid for shared decision making in pelvic organ prolapse: the SHADE-POP trial
.
Int Urogynecol J
.
2023
;
34
(
1
):
79
86
.
5.
Suarez-Lledo
V
,
Álvarez Gálvez
J
.
Prevalence of health misinformation on social media: systematic review
.
J Med Internet Res
.
2021
;
23
(
1
):
e17187
.
6.
OpenAI
.
Introducing ChatGPT
;
2023
[cited 2023 September 9th]; Available from: https://openai.com/blog/chatgpt.
7.
Doyal
AS
,
Sender
D
,
Nanda
M
,
Serrano
RA
.
ChatGPT and artificial intelligence in medical writing: concerns and ethical considerations
.
Cureus
.
2023
;
15
(
8
):
e43292
.
8.
Thirunavukarasu
AJ
,
Ting
DSJ
,
Elangovan
K
,
Gutierrez
L
,
Tan
TF
,
Ting
DSW
.
Large language models in medicine
.
Nat Med
.
2023
;
29
(
8
):
1930
40
.
9.
Iannantuono
GM
,
Bracken-Clarke
D
,
Floudas
CS
,
Roselli
M
,
Gulley
JL
,
Karzai
F
.
Applications of large language models in cancer care: current evidence and future perspectives
.
Front Oncol
.
2023
;
13
:
1268915
.
10.
Nori
H
,
King
N
,
McKinney
SM
,
Carignan
D
,
Horvitz
E
.
Capabilities of gpt-4 on medical challenge problems
.
arXiv preprint arXiv
;
2023
. 2303.13375.
11.
Hung
YC
,
Chaker
SC
,
Sigel
M
,
Saad
M
,
Slater
ED
.
Comparison of patient education materials generated by chat generative pre-trained transformer versus experts: an innovative way to increase readability of patient education materials
.
Ann Plast Surg
.
2023
;
91
(
4
):
409
12
.
12.
Campbell
DJ
,
Estephan
LE
,
Sina
EM
,
Mastrolonardo
EV
,
Alapati
R
,
Amin
DR
, et al
.
Evaluating ChatGPT responses on thyroid nodules for patient education
.
Thyroid
.
2024
;
34
(
3
):
371
7
.
13.
Eid
K
,
Eid
A
,
Wang
D
,
Raiker
RS
,
Chen
S
,
Nguyen
J
.
Optimizing ophthalmology patient education via ChatBot-generated materials: readability analysis of AI-generated patient education materials and the American Society of Ophthalmic Plastic and Reconstructive Surgery Patient brochures
.
Ophthalmic Plast Reconstr Surg
.
2023
;
10
:
1097
.
14.
Giannakopoulos
K
,
Kavadella
A
,
Aaqel Salim
A
,
Stamatopoulos
V
,
Kaklamanos
EG
.
Evaluation of the performance of generative AI large language models ChatGPT, Google Bard, and Microsoft Bing Chat in supporting evidence-based dentistry: comparative mixed methods study
.
J Med Internet Res
.
2023
;
25
:
e51580
.
15.
Grünebaum
A
,
Chervenak
J
,
Pollet
SL
,
Katz
A
,
Chervenak
FA
.
The exciting potential for ChatGPT in obstetrics and gynecology
.
Am J Obstet Gynecol
.
2023
;
228
(
6
):
696
705
.
16.
Allahqoli
L
,
Ghiasvand
MM
,
Mazidimoradi
A
,
Salehiniya
H
,
Alkatout
I
.
Diagnostic and management performance of ChatGPT in obstetrics and gynecology
.
Gynecol Obstet Invest
.
2023
;
88
(
5
):
310
3
.
17.
Koo
TK
,
Li
MY
.
A guideline of selecting and reporting intraclass correlation coefficients for reliability research
.
J Chiropr Med
.
2016
;
15
(
2
):
155
63
.
18.
Johnson
D
,
Goodman
R
,
Patrinely
J
,
Stone
C
,
Zimmerman
E
,
Donald
R
, et al
.
Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model
.
Res Sq
.
2023
:
rs.3.rs-2566942
.
19.
Mc Laughlin
GH
.
SMOG grading - a new readability formula
.
J Reading
.
1969
;
12
:
639
46
.
20.
Friedman
DB
,
Hoffman-Goetz
L
.
A systematic review of readability and comprehension instruments used for print and web-based cancer information
.
Health Educ Behav
.
2006
;
33
(
3
):
352
73
.
21.
Caglar
U
,
Yildiz
O
,
Meric
A
,
Ayranci
A
,
Gelmis
M
,
Sarilar
O
, et al
.
Evaluating the performance of ChatGPT in answering questions related to pediatric urology
.
J Pediatr Urol
.
2024
;
20
(
1
):
26.e1
26.e5
.
22.
Coskun
B
,
Ocakoglu
G
,
Yetemen
M
,
Kaygisiz
O
.
Can ChatGPT, an artificial intelligence language model, provide accurate and high-quality patient information on prostate cancer?
.
Urology
.
2023
;
180
:
35
58
.
23.
Song
H
,
Xia
Y
,
Luo
Z
,
Liu
H
,
Song
Y
,
Zeng
X
, et al
.
Evaluating the performance of different large language models on health consultation and patient education in urolithiasis
.
J Med Syst
.
2023
;
47
(
1
):
125
.
24.
Pushpanathan
K
,
Lim
ZW
,
Er Yew
SM
,
Chen
DZ
,
Hui’En Lin
HA
,
Lin Goh
JH
, et al
.
Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries
.
Iscience
.
2023
;
26
(
11
):
108163
.
25.
Lim
B
,
Seth
I
,
Bulloch
G
,
Xie
Y
,
Hunter-Smith
DJ
,
Rozen
WM
.
Evaluating the efficacy of major language models in providing guidance for hand trauma nerve laceration patients: a case study on Google’s AI BARD, Bing AI, and ChatGPT
.
Plast Aesthet Res
.
2023
;
10
:
43
.
26.
Xie
Y
,
Seth
I
,
Hunter-Smith
DJ
,
Rozen
WM
,
Seifman
MA
.
Investigating the impact of innovative AI chatbot on post-pandemic medical education and clinical assistance: a comprehensive analysis
.
ANZ J Surg
.
2024
;
94
(
1–2
):
68
77
.