Abstract
Introduction: Books and papers are the most relevant source of theoretical knowledge for medical education. New technologies of artificial intelligence can be designed to assist in selected educational tasks, such as reading a corpus made up of multiple documents and extracting relevant information in a quantitative way. Methods: Thirty experts were selected transparently using an online public call on the website of the sponsor organization and on its social media. Six books edited or co-edited by members of this panel containing a general knowledge of breast cancer or specific surgical knowledge have been acquired. This collection was used by a team of computer scientists to train an artificial neural network based on a technique called Word2Vec. Results: The corpus of six books contained about 2.2 billion words for 300d vectors. A few tests were performed. We evaluated cosine similarity between different words. Discussion: This work represents an initial attempt to derive formal information from textual corpus. It can be used to perform an augmented reading of the corpus of knowledge available in books and papers as part of a discipline. This can generate new hypothesis and provide an actual estimate of their association within the expert opinions. Word embedding can also be a good tool when used in accruing narrative information from clinical notes, reports, etc., and produce prediction about outcomes. More work is expected in this promising field to generate “real-world evidence.”
Introduction
Books and papers are the most relevant source of theoretical knowledge for medical education. New technologies of artificial intelligence can be designed to assist in selected educational tasks, such as reading a corpus made up of multiple documents and extracting relevant information in a quantitative way [1].
Natural language processing is a branch of computer science dealing with human language interpretation so that it can be possible to analyze texts or speeches, extract meaningful information, categorize it, and organize it [2, 3]. This study aimed to demonstrate how an artificial intelligence system trained on a corpus of books edited by senior experts was able to extract meaningful associations between words and recreate semantic contexts.
Methods
In March 2021, the G.Re.T.A. (Group for Therapeutic and Reconstructive Advancements) Fondazione ETS gathered a core team made of senior oncoplastic surgeons representative of the most important societies dealing with surgical oncology and oncoplastic surgery of the breast (ETHOS collaborative group). A subgroup of senior oncoplastic surgeons (facilitators) was invited to coordinate this activity.
Thirty experts were selected transparently using an online public call on the website of the sponsor organization and on its social media. Six books edited or co-edited by members of this panel containing a general knowledge of breast cancer or specific surgical knowledge have been acquired (see Table 1 for list of books). This collection was used by a team of computer scientists at the University of Catania, Dipartimento di Scienze del Farmaco, to train an artificial neural network based on a technique called Word2Vec [4‒6].
Cosine similarity values and complications (words selected by experts in a list of 1,000 words produced by ETHOS according to cosine similarity values)
Complication . | Cosine similarity . |
---|---|
Infection | 0.597 |
Hematoma | 0.502 |
Necrosis | 0.424 |
Extrusion | 0.393 |
Contracture | 0.310 |
Hernia | 0.257 |
Scarring | 0.2426 |
Complication . | Cosine similarity . |
---|---|
Infection | 0.597 |
Hematoma | 0.502 |
Necrosis | 0.424 |
Extrusion | 0.393 |
Contracture | 0.310 |
Hernia | 0.257 |
Scarring | 0.2426 |
This is a natural language processing algorithm able to learn word associations, suggest additional words, detect synonyms within a large corpus of written documents. A mathematical function (the “cosine similarity” between words) indicates the level of semantic association [7‒11]. Using this function, each word is assigned a different coordinate and the text is represented by the vector of the numbers of occurrences of each word in the document. A software named ETHOS is the final product of this process. A graphic interface allows the calculation of a few sub-functions based on cosine similarity (shown in Table 1).
The word2vec artificial neural network was interrogated with the purpose of exploring associations between different clusters of words. More specifically, words representative of surgical techniques (i.e., implant-based-reconstructions, mastectomy, breast conserving surgery) or specific clinical conditions (old-diabetes, etc.) were associated to a potential list of outcomes (extrusion-hematoma, etc.). Cosine similarity can also be calculated between clusters of words. This function was used to estimate how “close” clusters of words were in the vector space. The closer the distance, the higher the semantic association. Values ranged from -1 to 1 with CS = 1 representing two identical words.
In this specific task, this function was used to generate a list of 1,000 words close to the word: “complication-s.” Among these, the panelists manually identified a second list representative of multiple potential clinical scenarios. For each of the explored situations, the function produced an estimate that was subsequently inferred by the panel in order to extract meaningful associations.
Results
The corpus of six books contained about 2.2 billion words for 300d vectors [12‒17]. A few tests were performed by the facilitators group. We report some examples here.
First, words closer (according to cosine similarity) to the word “complication” were assessed. The experts then selected a list of words among a suggested list of 1,000 meaningfully associated to “complications.” These are infection CS = 0.509; hematoma CS = 0.502; dehiscence 0.4842, etc. (shown in Table 1).
After this, cosine similarity was assessed between the following words: “autologous-breast-reconstructions” versus words representing complications as previously identified. Same was done with “and implant-based-reconstructions” (shown in Table 1,-,3).
Cosine similarity values associated to autologous breast reconstructions
Technique . | Complication . | Cosine similarity . |
---|---|---|
Autologous breast reconstruction | Necrosis | 0.597 |
Autologous breast reconstruction | Infection | 0.465 |
Autologous breast reconstruction | Scarring | 0.447 |
Autologous breast reconstruction | Hematoma | 0.426 |
Autologous breast reconstruction | Hernia | 0.404 |
Technique . | Complication . | Cosine similarity . |
---|---|---|
Autologous breast reconstruction | Necrosis | 0.597 |
Autologous breast reconstruction | Infection | 0.465 |
Autologous breast reconstruction | Scarring | 0.447 |
Autologous breast reconstruction | Hematoma | 0.426 |
Autologous breast reconstruction | Hernia | 0.404 |
Cosine similarity values associated to implant-based reconstructions
Technique . | Complication . | Cosine similarity . |
---|---|---|
Implant-based reconstruction | Infection | 0.603 |
Implant-based reconstruction | Extrusion | 0.595 |
Implant-based reconstruction | Contracture | 0.587 |
Implant-based reconstruction | Hematoma | 0.232 |
Technique . | Complication . | Cosine similarity . |
---|---|---|
Implant-based reconstruction | Infection | 0.603 |
Implant-based reconstruction | Extrusion | 0.595 |
Implant-based reconstruction | Contracture | 0.587 |
Implant-based reconstruction | Hematoma | 0.232 |
A second, more complex clinical scenario was tested. We calculate the cosine similarity between words related to clinical cases with opposite features: words of clinical scenario 1: “young-small-breast-conserving-surgery” versus “complications,” cosine similarity value = 0.274; words of clinical scenario 2: “old-diabetes-breast-reconstruction” versus “complications,” cosine similarity value of 0.3736.
Some other scenarios were explored in the semantic area of oncological outcomes. For instance, the words “poor-outcome” and “triple-negative” retain a higher cosine similarity value (CS = 0.24) in comparison to the words “poor outcome” and “hormone-receptor-positive” 0.1901.
Discussion
The ETHOS Word2Vec artificial neural network trained on the corpus of knowledge of senior experts in breast cancer surgery tested in this study can be used for two purposes. First, it can be used to perform an enhanced and quantitative reading so that one can explore semantic associations and extract meaningful information. Hypothesis generated by the enhanced reading can inform surveys or be part of the design of clinical trials. For instance, the ETHOS survey [18, 19] used this tool to inform the panel about the semantic relevance of a list of preoperative features and postoperative outcomes. The panel expressed its opinion on approving or rejecting the proposed drivers after reviewing the quantitative information received. In this regard, probably the actual corpus made up of six books can be considered representative of a rather narrow sample of the narrative information available. Some examples of artificial neural networks available online are based on databases (http://epsilon-it.utu.fi/wv_demo/) that can reach up to 4.5B words or even more, and most of the cases are derived from general language texts. The actual corpus can be increased in terms of number of words, using a larger dataset that also includes journal articles or other books. On the other hand, in order to obtain a more refined view in selected subspecialties, it could be possible to handle texts from a specialized subset (for instance surgery), accepting the risk of losing some relevant information.
Alternatively, the system can be developed to reconstruct semantic scenarios (made up of patients’ features and outcomes) that are described in an unstructured narrative way and associate them to structured information (laboratory tests, ICD9 codes, SNOMED, BMI, etc.). Using this strategy, the clinical notes, letters to GP, and some other narrative text (including social media, chats with breast care nurses, etc.) can recreate personal profiles that indicate potential outcomes [20‒22]. In this way, the corpus of knowledge on a clinical condition will not only be derived from clinical trials, observational studies, or even expert opinions but also from so-called “real world data” [23]. To do this, a training population including a large number of patients should be used to train the artificial neural network. Narrative information will be collected prospectively and retrospectively from textual information, together with structured data about clinical condition of the patient. A second population (testing populations) will be used to test the ability of the system to identify the selected outcomes. A final external validation can be required for further improvement of the model.
The word embedding technique still has some limitations: first, the cosine similarity estimates the distance of two words within a corpus, assuming that closer words are likely to retain some kind of semantic association. This is not entirely true. In fact, sometimes some words, even very close (i.e., breast-cancer), can be very highly inter-related but not similar or synonymous. In the ETHOS tool, for instance, they retain a high similarity value (CS = 0.81), as do breast and patient (CS = 0.70), but the meaning behind these two examples is completely different. In fact, the first one is representative of a single entity (a disease); the second indicates two different entities (a human being and one of its organs).
Conclusion
This work represents an initial attempt to derive formal information from textual corpus. It can be used to perform an augmented reading of the corpus of knowledge available in books and papers as part of a discipline. This can generate new hypothesis and provide an actual estimate of their association within the expert opinions. These data can also be accrued prospectively and be insightful on changes across historical times. Word embedding can also be a good tool when used in accruing narrative information from clinical notes, reports, etc., and produce prediction about outcomes. More work is expected in this promising field to generate “real-world evidence.”
Statement of Ethics
An ethics statement was not required for this study type as no human or animal subjects or materials were used.
Conflict of Interest Statement
The authors have no conflicts of interest to declare.
Funding Sources
There are no funding sources to declare.
Author Contributions
Nicola Rocco, Giuseppe Catanuto, Maurizio Bruno Nava, Yazan Masannat, Konstantina Balafa, Andreas Karakatsanis, Anna Maglia, Peter Barry, Francesco Pappalardo, and Francesco Caruso: all the authors have been involved in the design and preparation of the manuscript.
Data Availability Statement
All data generated or analyzed during this study are included in this article. Further inquiries can be directed to the corresponding author.