Abstract
Objectives: The objective of this study was to examine the potential of artificial intelligence (AI) in gynecologic oncology decision-making. Design: A feasibility study was conducted. Participants: Fictitious case vignettes of patients with gynecologic carcinomas were used. Setting: The setting was a fictive one. Methods: Fictitious case vignettes of gynecologic carcinomas were created and evaluated by physicians with varying levels of professional experience, as well as by language models including ChatGPT 4.0, Google Gemini, and Bing Copilot. Treatment approval decisions were based on standardized clinical and laboratory criteria. Results: Two cases of breast cancer, 1 case of ovarian cancer, 1 case of cervical cancer, and 1 case of endometrial cancer were evaluated. All three language models were able to evaluate all clinical cases and make therapy-relevant suggestions, with ChatGPT providing the most clear and concise recommendations that were in 3 cases totally consistent with physician assessments. Limitations: This study was limited to a feasibility study based on five fictitious case vignettes. Conclusions: The study demonstrates that AI models, such as ChatGPT, can to some extent evaluate clinical cases, recognize clinical and/or laboratory abnormalities, and make therapy-related suggestions. Despite high overall agreement, differences were predominantly noted in the more complex cases, rendering human interpretation necessary. The findings underscore the benefits of AI in terms of clarity, time efficiency, and cost-effectiveness. Future research should further explore the application of AI to real patient data and development of hybrid decision models to optimize integration into clinical practice.
Introduction
In the context of gynecologic oncology, the integration of artificial intelligence (AI) in the multidisciplinary approach can contribute to efficiency and accuracy. AI can assist in the analysis of tumor markers, imaging studies, and other diagnostic tests by recognizing patterns faster and more accurately [1]. This enables a faster and more accurate assessment of the patient situation, which is particularly beneficial in times of high workload [2]. By automating and optimizing these processes, physicians can spend more time on direct patient care while ensuring that decisions regarding treatment initiation remain based on rigorous clinical- and laboratory-based criteria [3].
In addition, newer therapies, such as immunotherapy, pose new challenges even for physicians with many years of experience in dealing with chemotherapies, as the occurrence of unknown side effects is becoming increasingly complex. The complexity of the new side effects of therapies with antibody-drug conjugates, other immunotherapies, and combinations of conventional chemotherapies and new agents requires even more thorough and detailed monitoring of patients [4, 5].
The use of AI and in particular advanced language models such as ChatGPT in medicine marks a significant turning point. These technologies show the potential to analyze complex data, search medical literature, and even interact with patients in natural language [6]. In the context of consultations in certified oncology centers, where cancer patients have been shown to have a higher chance of survival [7], such models could help support medical decisions by providing evidence-based recommendations or rapidly processing complex data to refine diagnoses [8]. Nevertheless, issues of reliability, privacy concerns, and the need for human interpretation of results are paramount. These issues need to be carefully considered to ensure that the use of AI in medical practice is both effective and ethical [9, 10].
Even in an era of remarkable advances in medical diagnosis and treatment, the role of the physician in the oncology consultation remains irreplaceable. In the treatment of cancer patients, the decision to approve treatment is often based on a complex assessment of clinical findings and laboratory results. However, the introduction of AI into medical practice raises the question of whether and to what extent this technology could supplement or even replace traditional medical expertise. This article examines the potential and limitations of AI algorithms in decision-making for treatment approval in gynecologic cancer cases and highlights the challenges and ethical aspects associated with the integration of this technology into clinical practice.
Materials and Methods
Fictitious case vignettes of patients with gynecologic cancers were created that contained both clinical- and laboratory-based information. These vignettes were used to simulate the decision to approve oncologic therapy. The fictitious cases were created based on the most common side effects described and classified in the international guidelines and official drug information of the respective substance [11‒15]. These case vignettes were checked for plausibility and completeness by two independent gynecologic specialists with a focus on gynecologic oncology and 10 years of professional experience. These case vignettes were presented to the language models in text form with information on patient age, stage of disease, treatment cycle of the planned therapy, and clinical and laboratory adverse events. The language models were then asked whether the patients were eligible for treatment on the basis of the specified parameters (online suppl. File 1; for all online suppl. material, see https://doi.org/10.1159/000544751).
The study collective included a first-year medical doctor, a fourth-year medical doctor, a gynecology specialist with 8 years of experience, a gynecology specialist with a specialization in gynecologic oncology, and the three language models ChatGPT 4.0, Google Gemini, and Bing Copilot. The medical participants were selected according to their experience and expertise to ensure a broad range of assessments and to allow for an objective comparison between the decisions made by the doctors.
Each participant independently assessed the case vignettes for suitability for treatment approval. The assessments were based on the standardized clinical- and laboratory-based criteria commonly used in oncology practice, along with personal clinical experience. First, the decisions made by the doctors were compared in order to explore any potential interindividual discrepancies. Then, the decisions of the language models were reviewed by two independent gynecologic specialists with a focus on gynecologic oncology and 10 years of professional experience and assessed based on the established Global Quality Score for objective evaluation [16]. According to this scoring system, one point is awarded for “poor quality, very unlikely to be of use to patients”; two points for “poor quality but some information available, of very limited use to patients”; three points for “suboptimal information flow, some information is covered but important topics are missing, reasonably useful to patients”; four points for “good quality and flow, most important topics are covered; useful to patients”; and five points for “excellent quality and flow; very useful to patients.” The decisions of the doctors and the language models were then compared. The agreement and potential discrepancies in the decisions were analyzed to assess the performance of AI compared to human doctors. In the end, the doctors had the chance to read the suggestions made by AI and to review and eventually change their decisions based on these suggestions.
Results
Two cases of patients with breast cancer, 1 case of a patient with ovarian cancer, 1 case of a patient with cervical cancer, and 1 case of a patient with endometrial cancer were evaluated. The decisions made by the doctors were similar in the majority of the cases. In the first case, ChatGPT, Google Gemini, and Bing Copilot assessed the patient as fit for treatment despite second-degree neutropenia, which was consistent with the assessment of most physicians.
ChatGPT recommended supportive measures and reevaluation of symptoms for case 2 before making a final decision on whether to continue chemotherapy. This recommendation received the highest rating for clarity and precision.
For case 3, all AI models recommended a break in bevacizumab therapy and measures for wound treatment and blood glucose control, which was consistent with the physician assessments. ChatGPT and the doctors agreed in case 4 that the patient was fit for treatment provided the elevated TSH levels were treated and monitored.
ChatGPT and Bing Copilot assessed the last patient as fit for treatment, albeit with regular checks of the liver values. This assessment differed from some medical assessments, which recommended a break in treatment.
All case vignettes were evaluated by the language models, which provided therapy recommendations of varying quality. ChatGPT generally gave the most clear and precise recommendations, while Google Gemini and Bing Copilot’s recommendations varied in detail and accuracy.
The results showed that ChatGPT provided clear and precise recommendations that often matched the doctors’ assessments. This was confirmed by the Global Quality Score, which assessed the quality and usefulness of the recommendations. ChatGPT received the highest score with 19 points compared to Google Gemini with 17 points and Bing Copilot with 11 points.
Tables 1 and 2 summarize the main results of the study. Figure 1 shows the results of the evaluation of the decisions of the language models according to the Global Quality Score.
Comparison of decisions made by doctors
Cases . | Decision first-year medical doctor . | Decision fourth-year medical doctor . | Decision gynecology consultant without specialization . | Decision gynecology consultant with specialization . |
---|---|---|---|---|
Case 1 | Not suitable for treatment. In accordance with the official drug information, patients may only be treated once the neutrophil granulocyte count has returned to ≥1,500 cells/μL | Therapy-capable | Therapy-capable | Suitable for therapy due to high remission pressure. The neutrophil granulocytes are at 1,100 cells/μL, and for neoadjuvant chemotherapy, the value must be above 1,000 |
The neutrophil granulocytes are over 1,000 cells/μL | The neutrophil granulocytes are over 1,000 cells/μL | |||
Case 2 | Therapy-capable without dose reduction | Therapy-capable with dose reduction due to polyneuropathy, otherwise antiemesis and supportive therapy for the treatment of skin toxicity and psycho-oncological support in case of psychological stress | Therapy-capable. Supportive therapy for the treatment of skin toxicity | Give antibodies, pause taxane and, if symptoms persist, discuss dose reduction or discontinuation of chemotherapy with continuation of antibody therapy |
Addition of supportive measures for skin toxicity, polyneuropathy, and emesis | ||||
Case 3 | Not suitable for treatment. Pause therapy with bevacizumab until wound healing is complete, especially if the wound is superinfected. If necessary, start antibiotics, medication to control blood pressure and diabetes | Bevacizumab pause until local findings improve, swab taken, antibiotic treatment, and diabetology connection | Not suitable for therapy. Bevacizumab is an angiogenesis inhibitor | No administration of bevacizumab. Pause niraparib. Blood sugar control. Blood pressure control. Clarification of proteinuria via urine collection. If local wound improves, restart niraparib. If blood pressure has been adjusted and proteinuria has been clarified, restart bevacizumab if necessary |
First adjust diabetes and blood pressure | ||||
Case 4 | Therapy-capable | In the case of latent hypothyroidism, first, a nuclear medicine consultation and possibly the start of low-dose drug substitution | Therapy-capable | Give bevacizumab. Give pembrolizumab. Further clarification of thyroid gland including antibody determination. If necessary, start thyrostatic drug in the course of treatment |
In the case of newly occurring hypothyroidism under pembrolizumab, drug substitution of levothyroxine and monitoring | Clarification of the thyroid gland and start of drug substitution if necessary | |||
Case 5 | Not suitable for treatment. Interrupt treatment as described in the Information for healthcare professionals and resume treatment when toxicity decreases to grade 0 or 1. If the liver values continue to rise despite treatment discontinuation, clarification by means of hepatitis serology, cholestasis parameters, and sonography of the liver | Pause therapy in case of hepatotoxicity II grade and hyperbilirubinemia II grade. If necessary, perform liver sonography and laboratory control for cholestasis and hepatitis serology | Therapy-capable | No administration of chemotherapy. Clarification of liver by means of further laboratory tests including virology, ultrasound. If liver values improve, then possibly rechallenge including taxane |
Check liver values. If liver value elevations persist, liver sonography and exclusion of hepatitis |
Cases . | Decision first-year medical doctor . | Decision fourth-year medical doctor . | Decision gynecology consultant without specialization . | Decision gynecology consultant with specialization . |
---|---|---|---|---|
Case 1 | Not suitable for treatment. In accordance with the official drug information, patients may only be treated once the neutrophil granulocyte count has returned to ≥1,500 cells/μL | Therapy-capable | Therapy-capable | Suitable for therapy due to high remission pressure. The neutrophil granulocytes are at 1,100 cells/μL, and for neoadjuvant chemotherapy, the value must be above 1,000 |
The neutrophil granulocytes are over 1,000 cells/μL | The neutrophil granulocytes are over 1,000 cells/μL | |||
Case 2 | Therapy-capable without dose reduction | Therapy-capable with dose reduction due to polyneuropathy, otherwise antiemesis and supportive therapy for the treatment of skin toxicity and psycho-oncological support in case of psychological stress | Therapy-capable. Supportive therapy for the treatment of skin toxicity | Give antibodies, pause taxane and, if symptoms persist, discuss dose reduction or discontinuation of chemotherapy with continuation of antibody therapy |
Addition of supportive measures for skin toxicity, polyneuropathy, and emesis | ||||
Case 3 | Not suitable for treatment. Pause therapy with bevacizumab until wound healing is complete, especially if the wound is superinfected. If necessary, start antibiotics, medication to control blood pressure and diabetes | Bevacizumab pause until local findings improve, swab taken, antibiotic treatment, and diabetology connection | Not suitable for therapy. Bevacizumab is an angiogenesis inhibitor | No administration of bevacizumab. Pause niraparib. Blood sugar control. Blood pressure control. Clarification of proteinuria via urine collection. If local wound improves, restart niraparib. If blood pressure has been adjusted and proteinuria has been clarified, restart bevacizumab if necessary |
First adjust diabetes and blood pressure | ||||
Case 4 | Therapy-capable | In the case of latent hypothyroidism, first, a nuclear medicine consultation and possibly the start of low-dose drug substitution | Therapy-capable | Give bevacizumab. Give pembrolizumab. Further clarification of thyroid gland including antibody determination. If necessary, start thyrostatic drug in the course of treatment |
In the case of newly occurring hypothyroidism under pembrolizumab, drug substitution of levothyroxine and monitoring | Clarification of the thyroid gland and start of drug substitution if necessary | |||
Case 5 | Not suitable for treatment. Interrupt treatment as described in the Information for healthcare professionals and resume treatment when toxicity decreases to grade 0 or 1. If the liver values continue to rise despite treatment discontinuation, clarification by means of hepatitis serology, cholestasis parameters, and sonography of the liver | Pause therapy in case of hepatotoxicity II grade and hyperbilirubinemia II grade. If necessary, perform liver sonography and laboratory control for cholestasis and hepatitis serology | Therapy-capable | No administration of chemotherapy. Clarification of liver by means of further laboratory tests including virology, ultrasound. If liver values improve, then possibly rechallenge including taxane |
Check liver values. If liver value elevations persist, liver sonography and exclusion of hepatitis |
Comparison of decisions made by the language models
Cases . | ChatGPT decision . | Gemini decision . | Bing Copilot decision . |
---|---|---|---|
Case 1 | Therapy-capable | Most likely treatable due to mild neutropenia and good general condition | Therapy-capable |
Second-degree neutropenia without clinical consequences. Need for monitoring and prophylactic measures if necessary | Second-degree neutropenia without clinical consequences. Need for monitoring and prophylactic measures if necessary | ||
Global Quality Score: 3 | Global Quality Score: 3 | Global Quality Score: 3 | |
Case 2 | Initially initiate supportive measures. Only after reevaluation of symptoms and quality of life should a final decision be made on whether to continue chemotherapy | The decision is complex due to the serious side effects and psychological stress | Suitable for treatment depending on the risk/benefit assessment |
Optimization of supportive therapy and psychological support | |||
Global Quality Score: 5 | Global Quality Score: 4 | Global Quality Score: 3 | |
Case 3 | Bevacizumab discontinuation | Complex decision as to whether treatment is possible due to the open, infected wound, diabetes mellitus, and uncontrolled blood glucose levels. Optimization of wound care, antibiotic therapy, optimization of blood glucose control | Individual decision via oncologists |
Measures for wound treatment and blood glucose control. After successful treatment of the wound and stabilization of the blood glucose level, the ability to undergo therapy can be reassessed and maintenance therapy can be continued if necessary | |||
Global Quality Score: 5 | Global Quality Score: 4 | Global Quality Score: 2 | |
Case 4 | Treatment-capable if the elevated TSH value is confirmed and treated. Initiate appropriate hormone replacement therapy to monitor thyroid function | Decision as to whether treatment is possible is complex due to the newly discovered elevated TSH value | Treatment-capable in view of the unremarkable vital signs and laboratory results and the absence of specific complaints |
The cause of the thyroid dysfunction must be further clarified and the need for treatment assessed | Carry out further tests if necessary | ||
Global Quality Score: 5 | Global Quality Score: 3 | Global Quality Score: 2 | |
Case 5 | Treatment-capable | Complex decision as to whether treatment is possible due to the newly occurring liver damage. Close monitoring of liver values, supportive measures and, if necessary, dose adjustments or breaks in therapy | Suitable for therapy. Pay particular attention to liver values |
Regular checks of liver values and possible adjustment of therapy | |||
Reevaluate if there are signs of deterioration in liver function or the appearance of new symptoms | |||
Global Quality Score: 1 | Global Quality Score: 3 | Global Quality Score: 1 |
Cases . | ChatGPT decision . | Gemini decision . | Bing Copilot decision . |
---|---|---|---|
Case 1 | Therapy-capable | Most likely treatable due to mild neutropenia and good general condition | Therapy-capable |
Second-degree neutropenia without clinical consequences. Need for monitoring and prophylactic measures if necessary | Second-degree neutropenia without clinical consequences. Need for monitoring and prophylactic measures if necessary | ||
Global Quality Score: 3 | Global Quality Score: 3 | Global Quality Score: 3 | |
Case 2 | Initially initiate supportive measures. Only after reevaluation of symptoms and quality of life should a final decision be made on whether to continue chemotherapy | The decision is complex due to the serious side effects and psychological stress | Suitable for treatment depending on the risk/benefit assessment |
Optimization of supportive therapy and psychological support | |||
Global Quality Score: 5 | Global Quality Score: 4 | Global Quality Score: 3 | |
Case 3 | Bevacizumab discontinuation | Complex decision as to whether treatment is possible due to the open, infected wound, diabetes mellitus, and uncontrolled blood glucose levels. Optimization of wound care, antibiotic therapy, optimization of blood glucose control | Individual decision via oncologists |
Measures for wound treatment and blood glucose control. After successful treatment of the wound and stabilization of the blood glucose level, the ability to undergo therapy can be reassessed and maintenance therapy can be continued if necessary | |||
Global Quality Score: 5 | Global Quality Score: 4 | Global Quality Score: 2 | |
Case 4 | Treatment-capable if the elevated TSH value is confirmed and treated. Initiate appropriate hormone replacement therapy to monitor thyroid function | Decision as to whether treatment is possible is complex due to the newly discovered elevated TSH value | Treatment-capable in view of the unremarkable vital signs and laboratory results and the absence of specific complaints |
The cause of the thyroid dysfunction must be further clarified and the need for treatment assessed | Carry out further tests if necessary | ||
Global Quality Score: 5 | Global Quality Score: 3 | Global Quality Score: 2 | |
Case 5 | Treatment-capable | Complex decision as to whether treatment is possible due to the newly occurring liver damage. Close monitoring of liver values, supportive measures and, if necessary, dose adjustments or breaks in therapy | Suitable for therapy. Pay particular attention to liver values |
Regular checks of liver values and possible adjustment of therapy | |||
Reevaluate if there are signs of deterioration in liver function or the appearance of new symptoms | |||
Global Quality Score: 1 | Global Quality Score: 3 | Global Quality Score: 1 |
Global Quality Score: one point is awarded for “poor quality, very unlikely to be of use to patients,” whereas five points for “excellent quality and flow, very useful to patients.”
Comparison of the decisions made by ChatGPT, Google Gemini, and Bing Copilot in accordance with the Global Quality Score. x axis: Global Quality Score. One point is awarded for “poor quality, very unlikely to be of use to patients,” while five points for “excellent quality and flow, very useful to patients”.
Comparison of the decisions made by ChatGPT, Google Gemini, and Bing Copilot in accordance with the Global Quality Score. x axis: Global Quality Score. One point is awarded for “poor quality, very unlikely to be of use to patients,” while five points for “excellent quality and flow, very useful to patients”.
In a second run, the doctors were asked whether they would change their decisions after seeing the recommendations from AI. Since ChatGPT offered the best recommendations in terms of clarity compared to the other two language systems, the recommendations from ChatGPT were selected for the evaluation. Table 3 summarizes the results of the second run.
Change in the doctor’s decision after viewing the recommendations from ChatGPT
Cases . | Decision first-year medical doctor . | Decision fourth-year medical doctor . | Decision gynecology consultant without specialization . | Decision gynecology consultant with specialization . |
---|---|---|---|---|
Case 1 | No change | No change | Same decision. However, no statement regarding the administration of paclitaxel over 1,000 cells/μL | No change. Justification by AI only partially correct |
No justification based on studies or expert information | However, it should be emphasized that the therapy is not only for good clinical outcome, but also for neutropenia II° | |||
Case 2 | No change | No change | Same decision | Same decision |
Reasoning on the part of AI appropriate | However, no recommendation regarding dose reduction | Good recommendations on the part of AI | Good recommendations on the part of AI | |
Case 3 | Same decision | Same decision | Same decision | Same decision |
If necessary, add antibiotics after laboratory control | ||||
Case 4 | Same decision | Same decision | Same decision | Same decision |
Case 5 | No change | No change | Same decision | No change |
No justification based on studies or expert information | In case of hepatotoxicity II° and hyperbilirubinemia II°, pause therapy, and perform diagnostics, e.g., cholestasis or infection parameters | Unable to undergo therapy due to changes in liver values |
Cases . | Decision first-year medical doctor . | Decision fourth-year medical doctor . | Decision gynecology consultant without specialization . | Decision gynecology consultant with specialization . |
---|---|---|---|---|
Case 1 | No change | No change | Same decision. However, no statement regarding the administration of paclitaxel over 1,000 cells/μL | No change. Justification by AI only partially correct |
No justification based on studies or expert information | However, it should be emphasized that the therapy is not only for good clinical outcome, but also for neutropenia II° | |||
Case 2 | No change | No change | Same decision | Same decision |
Reasoning on the part of AI appropriate | However, no recommendation regarding dose reduction | Good recommendations on the part of AI | Good recommendations on the part of AI | |
Case 3 | Same decision | Same decision | Same decision | Same decision |
If necessary, add antibiotics after laboratory control | ||||
Case 4 | Same decision | Same decision | Same decision | Same decision |
Case 5 | No change | No change | Same decision | No change |
No justification based on studies or expert information | In case of hepatotoxicity II° and hyperbilirubinemia II°, pause therapy, and perform diagnostics, e.g., cholestasis or infection parameters | Unable to undergo therapy due to changes in liver values |
Discussion
Main Findings
The results of this study show a significant comparison between the decisions of human physicians and the recommendations of different AI language models regarding treatment approval for gynecologic cancers. While the physicians made different decisions in some cases, there were also differences in decision-making between the AI models. ChatGPT provided the clearest and most accurate recommendations and often aligned with physicians’ assessments, indicating that advanced AI models have a high potential to support medical decisions. However, in more complex cases, the AI models’ recommendations sometimes differed from those of the physicians, underscoring the necessity of human interpretation and assessment.
Our findings are consistent with other studies on the use of AI in medical decision-making. Several studies have demonstrated that AI systems, especially advanced language models, can analyze complex medical data and make informed treatment decisions that often agree with the assessments of experienced physicians [17, 18]. These studies confirm that AI is capable of supporting evidence-based decisions and thus enriching clinical practice. At the same time, however, they also emphasize the importance of human supervision and interpretation to ensure that all relevant clinical- and patient-specific factors are taken into account [19‒22].
Our study can be compared with other studies that have assessed the efficiency of ChatGPT vs. Google Gemini or Bing Copilot [23, 24]. One advantage of ChatGPT over Google Gemini and Bing Copilot in our study was the clarity and precision of the recommendations. Our study showed that the recommendations generated by ChatGPT were significantly more structured and comprehensible. While Google Gemini and Bing Copilot also made sound recommendations, these were often less detailed and sometimes more complex to interpret. The clarity of ChatGPT’s suggestions can help speed up decision-making processes and improve communication between doctors and patients.
Another advantage of using AI in therapy decisions is the time and cost factor. AI models such as ChatGPT can analyze large amounts of medical data in a very short time and derive well-founded recommendations [25]. This is in contrast to human doctors, who need significantly more time to analyze data and make decisions. Given the high workload and scarce resources in healthcare, the use of AI can lead to significant time savings [26]. In addition, the cost of operating and maintaining AI systems, once implemented, is potentially lower than the ongoing cost of physician time [27].
AI also has the potential to significantly reduce the workload of doctors, especially when it comes to triage during busy periods. In gynecologic oncology, AI could make quick and precise assessments by analyzing extensive patient data in order to prioritize urgent cases. This would allow doctors to use their resources more efficiently and reduce waiting times.
Strengths and Limitations
This paper presents a comprehensive comparative analysis between physicians and different AI language models to evaluate their performance in gynecologic oncology. The use of a wide range of fictitious case vignettes tests the robustness of the AI decisions and contributes to the generalizability of the results. Weaknesses of the study are the use of fictitious case vignettes instead of real patient data and the limitation to gynecologic cancers. The limited number of doctors taking part is also a limitation. The rapid development of AI technologies could quickly render the tested models obsolete.
Interpretation
For future research, AI models should be validated on real patient data and long-term studies should be conducted to evaluate the impact of AI-supported decisions. The development of hybrid decision-making models, the ethical and legal aspects of AI use, and the interactive capabilities of AI systems could be further investigated. Especially, the feasibility of implementing language models in decision-making processes during gynecologic oncology tumor conferences presents a promising yet complex opportunity. Language models can synthesize vast amounts of medical literature, clinical guidelines, and patient data to provide evidence-based recommendations, potentially enhancing the accuracy and efficiency of multidisciplinary discussions. In settings where patients undergo different therapies and therapy regimens, these models can offer tailored insights, highlight relevant clinical trials, and identify emerging treatment protocols.
However, challenges such as ensuring the accuracy and reliability of AI-generated suggestions, integrating these models seamlessly into existing clinical workflows, and addressing ethical concerns regarding data privacy and decision-making autonomy must be carefully managed. The accuracy of AI models is heavily dependent on the quality and comprehensiveness of the data used for training, which may not always reflect real-world variability in clinical presentations or treatment responses. Moreover, AI models, including advanced language models like ChatGPT, often lack the ability to consider the broader clinical context – such as patient preferences, comorbidities, or socioeconomic factors – that can significantly influence treatment choices. Thus, while AI can support decision-making, its recommendations must be rigorously validated, continuously updated, and supplemented by clinical expertise to ensure they remain reliable and relevant in diverse clinical settings. In medicine, particularly in oncology, every decision can have profound implications for patient outcomes, and inaccurate or unreliable recommendations could lead to suboptimal treatment choices. Given that AI models like ChatGPT are trained on vast datasets that include medical literature, clinical guidelines, and patient data, there is always the risk that the information the model draws upon may be outdated, incomplete, or not representative of the unique aspects of a specific patient’s case. These models are not only trained on professional and scientifically proven medical data, but on all medical data available to them, regardless of credibility. A possible solution for this would be to train a large language from the ground up using only credible sources. This, however, comes with the cost of training and maintaining this supposed model. Moreover, these models are designed to identify patterns but are not capable of the nuanced, context-sensitive reasoning that experienced physicians bring to decision-making. The need for continuous validation and real-time feedback loops is critical to improving the reliability of AI models. It is essential to validate AI recommendations not only against historical data but also in real-world settings, where patient circumstances and complex comorbidities might differ from those captured in training datasets. Additionally, AI should be used as a decision support tool rather than a sole decision-maker, particularly in cases involving intricate clinical details or patient preferences. Notably, there is the fact that large language models fall under the umbrella of non-explainable artificial intelligence. This means it is not possible to ascertain what training data the model uses for a specific response. The use of such a black box model in medicine must be thoroughly considered before implementing it.
AI models might also introduce or perpetuate biases if the data they are trained on do not accurately represent the diverse patient populations they are meant to serve. For example, if a model is predominantly trained on data from one demographic group, it may fail to account for variations in treatment response across different racial, ethnic, or socioeconomic groups.
The use of AI in clinical settings raises significant ethical issues, particularly regarding data privacy, informed consent, and the potential for human oversight to be compromised. AI systems that handle patient data must be carefully regulated to ensure patient confidentiality is maintained. Moreover, reliance on AI models for decision-making could unintentionally lead to a reduction in physician accountability. Ensuring that human clinicians maintain the final say in treatment decisions is paramount for preserving patient safety and trust.
The widespread adoption of large language models like ChatGPT in healthcare, while offering numerous benefits in terms of efficiency and decision support, also raises concerns regarding their environmental impact. The training and deployment of these models require significant computational resources, which translate into substantial energy consumption and carbon emissions. Training state-of-the-art AI models involves processing vast datasets across powerful data centers, consuming large amounts of electricity, often sourced from nonrenewable energy. As healthcare systems increasingly rely on AI tools, the environmental footprint of such technologies could grow considerably. Given the scale at which AI is expected to be integrated into clinical workflows, including gynecologic oncology, it is essential to consider sustainable AI practices – such as optimizing energy efficiency, using renewable energy sources, and developing more energy-efficient algorithms. While the immediate benefits of AI in improving healthcare delivery are clear, it is critical that the healthcare sector also weighs the long-term environmental sustainability of these technologies to ensure that the ecological cost does not outweigh the potential clinical advantages.
Pilot studies and robust validation processes are essential to evaluate the practical benefits and limitations of these models, ensuring they complement rather than replace the clinical expertise and judgment of healthcare professionals in gynecologic oncology.
Conclusion
The results of this feasibility study show that AI models can provide sound recommendations for gynecologic cancers. ChatGPT in particular provided clear and precise recommendations. However, complex cases showed the need for human expertise to take into account additional clinical and contextual factors. These results underline the potential of AI as a supportive tool which can understand and to some extent evaluate clinical cases, recognize clinical and/or laboratory abnormalities, and make therapy-related suggestions. Future trials still, however, need to verify its applicability in the clinic routine.
Acknowledgments
ChatGPT 4.0 (Open AI, https://chat.openai.com) was used to proofread the final draft.
Statement of Ethics
An ethics statement was not required for this study type since no human or animal subjects or materials were used. Written informed consent was obtained from all doctors participating in this study.
Conflict of Interest Statement
The authors have no conflicts of interest to declare.
Funding Sources
No funding was received.
Author Contributions
Conception: I.P. and J.E.; case analysis: N.S., F.A.S., F.H., I.P., and J.E.; writing: I.P.; and supervision: P.A.F., M.W.B., S.B., A.K., P.P., and J.E.
Data Availability Statement
The data that support the findings of this study are available as an online supplementary file. Further inquiries can be directed to the corresponding author.