Background: The aim of this study is to systematically review the literature to summarize the evidence surrounding the clinical utility of artificial intelligence (AI) in the field of mammography. Databases from PubMed, IEEE Xplore, and Scopus were searched for relevant literature. Studies evaluating AI models in the context of prediction and diagnosis of breast malignancies that also reported conventional performance metrics were deemed suitable for inclusion. From 90 unique citations, 21 studies were considered suitable for our examination. Data was not pooled due to heterogeneity in study evaluation methods. Summary: Three studies showed the applicability of AI in reducing workload. Six studies demonstrated that AI can aid in diagnosis, with up to 69% reduction in false positives and an increase in sensitivity ranging from 84 to 91%. Five studies show how AI models can independently mark and classify suspicious findings on conventional scans, with abilities comparable with radiologists. Seven studies examined AI predictive potential for breast cancer and risk score calculation. Key Messages: Despite limitations in the current evidence base and technical obstacles, this review suggests AI has marked potential for extensive use in mammography. Additional works, including large-scale prospective studies, are warranted to elucidate the clinical utility of AI.
Mammography remains a critical tool for screening and diagnosing breast cancers. Advocates for mammography screening refer to its widely documented contribution in reducing breast cancer mortality rates [1-4]. While mammographic screening has established reduction in mortality, like any examination, there is a false-positive rate associated with screening mammography. While only 7–12% of women are falsely recalled after only one mammogram, over 50% of women who have undergone annual mammography screening for 10 years will be recalled incorrectly [5, 6]. These false positives translate into increased benign biopsies, increased spending, and negative psychological effects for patients involved [7-9]. Likewise, potentially malignant neoplasms are at risk of being missed due to their small size or surrounding dense fibroglandular tissue . False-negative mammogram results have a higher incidence in women aged 50–89 with previous benign biopsies. However, the rate of false-negative results is still relatively low, reported in 1.0 to 1.5 per 1,000 women . Although its accuracy continues to improve with technical improvements, diagnostic mammography is the gold standard for evaluation of breast cancer . In order to further increase the accuracy and to further reduce the rates of false positives and false negatives, recent advances in artificial intelligence (AI) have been exploited to develop software capable of aiding radiologists in clinical practice.
AI is a field incorporating computer science, engineering, and statistics to produce intelligent computer programs capable of performing tasks that would otherwise require human intelligence . AI has been widely implemented in healthcare, ranging from one of its earliest uses in blood disease diagnosis to current implementations in genomic medicine to characterize mutations [14, 15]. Although most AI in medicine is limited to research, certain commercial algorithms have already been approved for use by radiologists in clinical practice to help decrease erroneous readings [16-18]. Currently, many of these AI-based tools designed for aiding radiologists and interpreting mammograms are developed with machine learning. Machine learning is a specific domain of AI and is concerned with constructing algorithms used by computers to perform certain tasks without using explicit instructions, but instead relying on inference and patterns and are able to improve their performance with experience .
In regards to mammography, the program usually trains on clean labelled mammograms with abnormal diagnosis and normal anatomy serving as labels. As the program trains with greater and more diverse images, it adjusts the algorithms used with the ultimate goal of becoming more accurate in reading images and predicting patterns. As recent advancements allow for the increased scalability of AI software into clinical practice, it is imperative to examine the current environment surrounding AI in mammography.
In this literature review, we critically evaluate published literature describing machine-learning algorithms currently used in mammography and their efficacy in reducing erroneous readings. We conclude with future speculations of the field with a focus on the effects of AI.
Methods and Materials
The primary aim was to evaluate existing literature that used AI in mammography. Several databases were searched including PubMed, IEEE Xplore, and Scopus, with the search terms described in Table 1. Studies were limited to the interval January 2015 to March 2020. Exclusion criteria included (1) lack of conventional performance metrics such as sensitivity, specificity, area under the receiver operating characteristic curve (AUROC), etc., (2) the study investigated the accuracy of image segmentation rather than disease classification, (3) the study only published guidelines, review articles, and abstracts, (4) animal studies, (5) non-English sources, (6) sample size smaller than 10. After removing duplicate studies and those with redundant or non-novel information, 60 unique studies were ultimately included in this review.
Reducing Diagnostic Workload
One of the main potentials of AI in mammography is to reduce workload by helping expedite the interpretation of more obvious cases and thereby allowing radiologists to focus on more challenging cases. Most of the models that classify images use deep learning (DL). DL is a subset of machine learning where algorithms are created and operate similarly to those in machine learning, but there are several “layers” of these algorithms – each providing a unique interpretation of the image feature (e.g., shape, morphology, texture) it analyzes . DL involves training a complex neural network, in which data from the input layer feeds into the first hidden layer, which processes the data further and is sent to the next layer of algorithms (each layer consists of nodes and is loosely modelled from neurons), until the output layer is reached. Before the DL system is implemented, however, it is important to examine whether the reduced caseload will cause any detrimental effects to radiologists’ abilities to classify.
A study by Rodríguez-Ruiz et al.  used a commercially available DL system (Transpara 1.4.0, Screenpoint Medical BV) to pre-select digital mammogram images based on the machine learning interpreted likelihood of breast cancer. The system assigned a numerical likelihood score to each image from 1 to 10 (10 implying high likelihood cancer was present in the exam). Pre-selection scenarios included exams-to-be-evaluated as those with scores greater than a certain likelihood score. For example, one scenario would only include images scored 1 or higher to be evaluated by radiologists. This was done for all likelihood scores 1–9. The radiologists’ average -AUROC, a metric used to evaluate ability to classify, was compared before versus after pre-selection for each scenario. An AUROC of 1 represents perfect classification ability, while 0.5 represents incorrect random classification. Although the absolute area under the curve was not reported, the average AUROC of the radiologists was non-inferior at a margin of 0.05 for 8 out of the 9 scenarios tested (p < 0.05) . Thus, pre-selecting cases did not affect the ability of radiologists to classify in a majority of the scenarios tested.
Another recent study by Yala et al.  developed an in-house DL model to triage out true-negative cases. After comparing the performance of the radiologists during their original screening assessment to a retrospective scenario where the radiologists did not read any of the DL-triaged cases, the model showed a workload reduction of 19.3%. The study did not specify how accurate the machine learning algorithm was at identifying true negatives. The radiologists in the study showed a significant improvement in specificity (93.5–94.2%; p = 0.002) and a non-inferior sensitivity (90.6 vs. 90.1%; p < 0.001) at a margin of 0.05 .
Another method by which these models can reduce workload is through consensus. Some clinical settings utilize a double-reading process where two radiologists examine the same image. The study by McKinney et al.  replaced the second reader with their AI model. The conclusion of the first reader was deemed final when the model and first reader agreed. Any disagreement invoked the second reader’s opinions as usual. The study concluded that this process can alleviate workload of the second reader up to 88% . These studies illustrated the ability of AI to reduce caseload while not affecting the overall detection performance of radiologists, and may in fact aid their detection performance. Of note, two readers are not a standard of care clinical practice yet in the US due to cost; however, this study points towards future possibility towards reduction in cost for a second reader if aided by machine learning if clinical standard of care moves in that direction in future.
In addition to reducing caseload, models have been developed to aid radiologists examining mammograms in real-time. Rodríguez-Ruiz et al.  compared breast cancer detection performance of radiologists aided versus unaided by a commercial AI system. The system provided radiologists with interactive decision support and cancer likelihood scores. On average, AUROC increased with AI support compared to unaided (0.89 vs. 0.87, respectively; p = 0.002). Sensitivity increased (83–86%; p = 0.046) and reading time per case was similar (146 s unaided vs. 149 s aided; p = 0.15) . An additional study showed similar results where radiologists who used an AI-based computer-aided detection (AI-CAD) software showed an increase in their classification performance; AUROC increased with AI-CAD support (0.76 to 0.815; p < 0.01) (Table 2) .
A drawback of currently available CAD systems is the high rate of false-positive markings, which may hinder performance of radiologists. A recent retrospective study compared a commercial AI-CAD (cmAssist, CureMetrix) to a conventional CAD system (ImageChecker, Hologic) with respect to the false-positive findings marked. The findings showed an overall 69% reduction in false positives per image (FPPI) using AI-CAD compared to CAD. Specifically, AI-CAD exhibited 83% reduction in FPPI for calcifications and 56% reduction for masses. There was no significant decrease in sensitivity . Further studies have also created their own in-house models to be able to mark suspicious masses for radiologists to scrutinize more carefully. The mass segmentation model from Hmida et al.  exhibits 91.1% sensitivity, while another DL model from Sapate et al.  also shows 91% specificity and a sensitivity of 84%.
To improve digital mammogram readings through a different approach, Zeng et al.  incorporated radiologists’ subjective thresholds from Breast Imaging Reporting and Data System (BI-RADS). Using the BI-RADS descriptors and decision classification categories assigned to mammograms by radiologists, a probabilistic Bayesian network was modelled. A Bayesian network represents the probability distribution of random variables with possible causal relationships . Zeng et al.  hypothesized that each BI-RADS assessment category was thought to have a probability associated with it leading the radiologist to classify the mammogram as a positive or negative finding. The probability thresholds used by each radiologist are subjective. Therefore, if a Bayesian network incorporated the unique threshold observed from a radiologist, it may aid to reduce erroneous readings. Indeed, the study found that this method resulted in a 28.9% reduction of false positives . This novel technique exemplifies how AI can gauge the conservativeness of a radiologist and use the data to support its classification decisions.
Independent Detection and Classification
Programs that can interpret and identify abnormalities in mammograms without radiologist intervention have also been evaluated. When comparing commercial AI software (Transpara 1.4.0, Screenpoint Medical BV) to radiologists in the detection of breast cancer in digital mammograms, the performance of the AI system was found to be statistically non-inferior to the average performance of the radiologists at a non-inferiority margin of 0.05 . Agnes et al.  recently proposed a DL model that can categorize mammograms into normal, malignant, and benign categories with an overall sensitivity of 96%. Another group evaluated three different DL models for classifying a breast lesion as benign or malignant. The overall accuracies ranged from 88 to 95% .
While it is useful to distinguish between negative and positive cases, it is also clinically important to discriminate recalled-benign from malignant cases. Aboutalib et al.  developed a DL model that boasts an AUROC of up to 0.96 for correctly identifying recalled but biopsy-benign cases from malignant and negative cases.
DL tools are also being developed for novel breast-imaging techniques. Digital breast tomosynthesis (DBT) is an emerging tool to overcome limitations of conventional full-field digital mammography (FFDM) and allows volumetric reconstruction of the whole breast. DL models have already been tested on DBT datasets, with future plans to incorporate DBT and FFDM images for more accurate diagnosis . Likewise, images from contrast-enhanced digital mammography are being used to train DL models with the goal of extracting more image characteristics for better diagnostic ability .
Breast Cancer Prediction
Previous studies have explored breast cancer risk factors related to hormonal and genetic information [34, 35]. Mammographic breast density, which corresponds to the amount of fibroglandular tissue, has received substantial attention as a major risk factor . Indeed, commonly used breast cancer risk assessments, such as the Tyrer-Cuzick model, are based on mammographic density estimations and questionnaires . However, the use of breast density as a substitute to mammograms for risk prediction is limited because density estimates vary across radiologists and compressing the detailed information of digital scans into a single number loses some of the inherent data . Studies have shown the possibility of unique image features extractable by DL, beyond breast density, which can be used to create more accurate breast cancer risk models.
In a retrospective study, Debrower et al.  developed a DL model that calculated breast cancer risk scores (DL risk score) from mammograms and compared the prediction ability of these scores to conventional dense area and percentage density values. The AUROC for using the age-adjusted DL risk scores were significantly higher than using the dense area and percentage density values (DL risk score: 0.65, dense area: 0.60, percentage density: 0.57; p < 0.001) . A similar study communicated a DL model that showed greater predictive potential (odds ratio = 4.42 [95% CI: 3.4–5.7]) compared to using breast density alone (odds ratio = 1.67 [95% CI: 1.4–1.9]) . Arefan et al.  compared two DL models against a basic linear regression model using breast density, showing the models’ two different prior normal image types (mediolateral oblique, craniocaudal) to assess risk prediction efficacy. They found that both DL models consistently exhibited superiority in predicting the short-term breast cancer risk than the density-based model .
Breast density is also an important risk factor for interval breast cancers, which are cancers detected within 12 months after normal mammographic screening and comprise up to 28% of all breast cancers in the United States . Hinton et al.  implemented a model to predict whether a pre-cancer mammogram will result in a screen-detected or interval cancer within 12 months from imaging. When using the BI-RADS density alone, the capacity to correctly distinguish interval cancers from screen-detected cancers was moderate (AUROC = 0.65). Yet, when optimized with the mammograms, the model achieved an AUROC of 0.82 (Table 2) . These studies have revealed how AI can use subtle imaging features, not otherwise detectable by humans, to develop better prediction tools than common density-based models.
Hybrid-DL models that incorporate both patient history and radiologic images have also been tested. When used to predict biopsy malignancy, a hybrid-DL algorithm attained 87% sensitivity, 77.3% specificity, and an overall AUROC of 0.91. Even when trained on clinical data alone, the model performed significantly better than the Gail model, another commonly used breast cancer risk tool (AUROC, 0.78 vs. 0.54, respectively; p < 0.004) . These findings are similar to those from Yala et al. , whose hybrid-DL model was compared to a risk factor-based regression model (RF-LR) that used traditional risk factor information only and the established Tyrer-Cuzick model (TC). The findings showed that the hybrid-DL model better predicted the development of breast cancer within 5 years based from the initial FFDM exam when compared to the RF-LR and TC (AUROC, hybrid-DL: 0.70, RF-LR: 0.67, TC: 0.62, p < 0.05) (Table 3) .
This review derived its findings from retrospective studies and small reader studies. While there have been preliminary discussions for implementing standards for integration of AI into clinical practice, benchmarks are controlled by multiple stakeholders with a current lag in guidelines . Despite these shortcomings, the potential for AI in mammography is encouraging as evidenced by recent research. Independent groups, plus commercial organizations, have demonstrated the ability of machines to construct more efficient workloads for radiologists, all the while assisting in diagnosis and predicting cancer risk. As technological advancement continues, we expect AI will play a critical role to assist radiologists practicing mammography.
One of the largest obstacles facing the progression of AI in clinical medicine is the limitation to development and reliability of complex algorithms with the capability of handling multivariable data that factors into physician decision-making . This also brings in the question of responsibility when evaluating complex medical cases. Furthermore, the usage of AI requires the implementation of an ethics code promoting transparency and reliability. Geis et al.  call attention to the fact that various AI models are relatively straightforward to produce and develop, which emphasizes the importance of streamlining existing perspectives on the subject to focus on radiology specifically. Few existing studies satisfactorily evaluate how AI can be utilized in a way that maximizes impact in a responsible way, and further evaluation of the accuracy and reliability of AI is required in order for this technology to be confidently used in clinical application. Despite these limitations, our general expectation is that AI is capable of playing a significant role in mammography.
AI is a field that does not yield to one approach or solution. There are numerous algorithms that can be deployed, not to mention countless ways of training, testing, and validating the models. This makes choosing the right framework still more difficult, but also for the same reason, allows great flexibility for the integration of AI into mammography. Nevertheless, as AI is poised to enter into the field at an increasing rate, there are certain issues that must be anticipated.
After training a model, validation is required. The validation stage refers to evaluating the model on how well it has been trained and tuning its hyperparameters. The steps of training and validation are extremely important when developing a clinically relevant model with optimum predictive potential . An increasing worry revolves around creating fair AI practices that promote equity in diagnosis and treatment . Accordingly, developers will need to prove their model’s applicability for large and diverse populations, including minority ethnic/racial groups and those with less common risk factors. This will require cooperation with large hospital networks and other organizations with heterogeneous imaging data to be able to properly train and validate the AI programs . Indeed, AI systems that are generalizable across large populations are already being tested .
While FFDM is the most common imaging modality currently used, emerging screening technologies such as DBT are rapidly gaining popularity. Future AI programs must be able to handle the transition from 2-dimensional to 3-dimensional data, the latter of which will require greater storage and computing needs. As with FFDM, these models will need to train on large and diverse DBT datasets to remain clinically relevant . Moving forward, AI designers must adapt to any other novel mammographic screening techniques or risk lagging behind the clinical setting.
Given the esoteric nature of both AI algorithms and mammography, it is crucial for AI programmers and radiologists to work together to alleviate knowledge gaps from both areas. These groups must also be transparent with regulatory agencies, such as the United States Food and Drug Administration (FDA), who ultimately approve whether an AI model can be used as a legitimate medical tool. Most studies evaluating AI are retrospective and small-scale reader studies aimed to demonstrate non-inferiority compared to mammography experts; it is uncertain if prospective or other types of trials are required to convince stakeholders of a model’s clinical utility .
There have also been growing concerns over cybersecurity as the amount of data stored on medical databases, including imaging data, increases . In order to determine the true performance of a model, only mammograms linked to clinically confirmed outcomes are used. Thus, developers not only require a diverse array of mammogram registries to use, but also detailed patient history – especially if the model is designed to use both imaging and clinical data. Novel technologies for disseminating medical imaging data are being implemented, such as blockchain . Nonetheless, imaging data organizations may be reluctant to share millions of mammograms and patient history to research groups without complex data use agreements. Even once the AI tool has been approved and implemented, AI algorithms need to be continuously monitored for possible improvements and security risks to mitigate probability of imaging data sabotage .
Programs using AI may be useful with increased efficiency, assistance in diagnosis, and risk prediction in mammography. Future work will need to consider appropriate standards when evaluating the suitability of a model to be used regularly in clinical practice.
Statement of Ethics
This article does not contain any studies with human participants performed by the author. Our research complies with the guidelines for human studies and was conducted ethically in accordance with the World Medical Association Declaration of Helsinki (includes also research on identifiable human material and data).
Conflict of Interest Statement
The authors have no conflicts of interest to declare.
No funding was received for this study.
J.W. formulated the experimental design. S.B., F.L., and A.A. gathered data and interpreted results. M.U. provided guidance. The manuscript was written and edited by all authors. All authors read and approved the final manuscript.