Abstract
Background: While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at later stages and have a poorer prognosis. The use of artificial intelligence (AI) models can potentially improve early detection of skin cancers; however, the lack of skin color diversity in training datasets may only widen the pre-existing racial discrepancies in dermatology. Objective: The aim of this study was to systematically review the technique, quality, accuracy, and implications of studies using AI models trained or tested in populations with skin of color for classification of pigmented skin lesions. Methods: PubMed was used to identify any studies describing AI models for classification of pigmented skin lesions. Only studies that used training datasets with at least 10% of images from people with skin of color were eligible. Outcomes on study population, design of AI model, accuracy, and quality of the studies were reviewed. Results: Twenty-two eligible articles were identified. The majority of studies were trained on datasets obtained from Chinese (7/22), Korean (5/22), and Japanese populations (3/22). Seven studies used diverse datasets containing Fitzpatrick skin type I–III in combination with at least 10% from black Americans, Native Americans, Pacific Islanders, or Fitzpatrick IV–VI. AI models producing binary outcomes (e.g., benign vs. malignant) reported an accuracy ranging from 70% to 99.7%. Accuracy of AI models reporting multiclass outcomes (e.g., specific lesion diagnosis) was lower, ranging from 43% to 93%. Reader studies, where dermatologists’ classification is compared with AI model outcomes, reported similar accuracy in one study, higher AI accuracy in three studies, and higher clinician accuracy in two studies. A quality review revealed that dataset description and variety, benchmarking, public evaluation, and healthcare application were frequently not addressed. Conclusions: While this review provides promising evidence of accurate AI models in populations with skin of color, the majority of the studies reviewed were obtained from East Asian populations and therefore provide insufficient evidence to comment on the overall accuracy of AI models for darker skin types. Large discrepancies remain in the number of AI models developed in populations with skin of color (particularly Fitzpatrick type IV–VI) compared with those of largely European ancestry. A lack of publicly available datasets from diverse populations is likely a contributing factor, as is the inadequate reporting of patient-level metadata relating to skin color in training datasets.
Introduction
Skin cancer is the most common malignancy worldwide, with melanoma representing the deadliest form. While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at a later stage and have a poorer prognosis when compared to Caucasian populations [1‒3]. Even when diagnosed at the same stage, Hispanic, Native, Asian, and African Americans have significantly shorter survival time than Caucasian Americans (p < 0.05) [4]. Skin cancers in people with skin of color often present differently from those with Caucasian skin and are often underrepresented in dermatology training [5, 6].
The use of artificial intelligence (AI) algorithms for image analysis and detection of skin cancer has the potential to decrease healthcare disparities by removing unintended clinician bias and improving accessibility and affordability [7]. Skin lesion classification by AI algorithms to date has performed equivalently to [8] and, in some cases, better than dermatologists [9]. Human-computer collaboration can increase diagnostic accuracy further [10]. However, most AI advances have used homogenous datasets [11‒15] collected from countries with predominantly European ancestry [16]. Exclusion of skin of color in training datasets poses the risk of incorrect diagnosis or missing skin cancers entirely [8] and risks widening racial disparities that already exist in dermatology [8, 17].
While multiple reviews have compared AI-based model performances for skin cancer detection [18‒20], the use of AI in populations with skin of color has not been evaluated. The objective of this study was to systematically review the current literature for AI models for classification of pigmented skin lesion images in populations with skin of color.
Methods
Literature Search
The systematic review follows the PRISMA guidelines [21]. A protocol was registered with PROSPERO (International Prospective Register of Systematic Reviews) and can be accessed at https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42021281347.
A PubMed search in March 2021 used search terms relating to artificial intelligence, skin cancer, and skin lesions (search strings in online suppl. eTable; for all online suppl. material, see www.karger.com/doi/10.1159/000530225). No date range was applied, language was restricted to English, and only original research was included. Covidence software was used for screening administration. Search results were screened by reviewing titles/abstracts by two independent reviewers (Y.L. 100% and B.B.S. 20%) using eligibility criteria described in Table 1. Remaining articles were assessed for eligibility by reviewing methods or full text. Disagreements were resolved following discussions with a third independent reviewer (C.P.).
Inclusion and exclusion criteria used for screening and assessing eligibility of articles
Inclusion criteria . | Exclusion criteria . |
---|---|
1. Any computer modeling or use of AI on diagnosis of skin conditions2. Datasets provide information on the population (racial or Fitzpatrick skin type breakdown) or datasets obtained from countries with predominantly skin of color population3. Uses dermoscopic, clinical, 3D, or other photographic images of the skin surface4. Includes the assessment of malignant and/or non-malignant pigmented skin lesions | 1. No population description of the training datasets (demographic, racial, or ethnicity breakdown) or from a country with predominantly Caucasian population of European ancestry2. Dataset description with >90% Caucasian population, or fair skin type, or Fitzpatrick skin type I–III3. Solely used images from ISIC [56], PH2 [13], IAD [57], ISBI [58], HAM10000 [12], MED-NODE [14], ILSVRC [59], DermaQuest [60], DERMIS [61], DERM IQA[62], DermNet NZ [63], and datasets known to be of predominantly European ancestry |
Inclusion criteria . | Exclusion criteria . |
---|---|
1. Any computer modeling or use of AI on diagnosis of skin conditions2. Datasets provide information on the population (racial or Fitzpatrick skin type breakdown) or datasets obtained from countries with predominantly skin of color population3. Uses dermoscopic, clinical, 3D, or other photographic images of the skin surface4. Includes the assessment of malignant and/or non-malignant pigmented skin lesions | 1. No population description of the training datasets (demographic, racial, or ethnicity breakdown) or from a country with predominantly Caucasian population of European ancestry2. Dataset description with >90% Caucasian population, or fair skin type, or Fitzpatrick skin type I–III3. Solely used images from ISIC [56], PH2 [13], IAD [57], ISBI [58], HAM10000 [12], MED-NODE [14], ILSVRC [59], DermaQuest [60], DERMIS [61], DERM IQA[62], DermNet NZ [63], and datasets known to be of predominantly European ancestry |
DERM IQA, Dermoscopic Image Quality Assessment [62]; DERMIS, dermatology information system; ISIC, International Skin Imaging Collaboration [56]; DermNet NZ, DermNet New Zealand [63]; IAD, Interactive Atlas of Dermoscopy [57]; ISBI, International Symposium on Biomedical Imaging [58]; ILSVRC, ImageNet Large Scale Visual Recognition Challenge [59]; HAM10000, Human Against Machine with 10,000 training images [12]; MED-NODE, computer-assisted MElanoma Diagnosis from NOn-DErmoscopic images [14].
Data Extraction and Synthesis
Data extraction was performed using a standardized form by author Y.L. and confirmation by V.K. The following parameters were recorded: reference, ethnicity/ancestry/race, lesion number, sex, age, location, skin condition, public availability of dataset, number of images, type of images, methods of confirmation, deep learning system, model output, comparison with human input, and any missing data reported. Algorithm performance measures are recorded by either accuracy, sensitivity, specificity, and/or area under the receiver operating characteristic curve. A narrative synthesis of extracted data was used to present findings as a meta-analysis was not feasible due to heterogeneity of the study design, AI systems, skin lesions, and outcomes.
Quality Assessment
Quality was assessed using the Checklist for Evaluation of Image-Based Artificial Intelligence Reports in Dermatology (CLEAR Derm) Consensus Guidelines [22]. This 25-point checklist offers comprehensive recommendations on factors critical to the development, performance, and application of image-based AI algorithms in dermatology [22].
Author Y.L. performed the quality assessment of all included studies, and author B.B.S. assessed 20%. Inter-rater agreement rate was 87%, with disagreements resolved via a third independent reviewer (V.K.). Each criterion was evaluated to be either fully, partially, or not addressed and scored either 1, 0.5, or 0, respectively, using a scoring rubric in online supplementary eTable2.
Results
The database search identified 993 articles, including 13 duplicates. After screening titles/abstracts, 535 records were excluded, and the remaining 445 records were screened by methods, with 63 articles reviewed by full text. Forward and backward citations search revealed no additional articles. A total of 22 studies were included in the final review (PRISMA flow diagram in online supplementary eFig. 1).
Study Design
All 22 studies were performed between 2002 and 2021 [23‒32], with 11 (50%) studies published between 2020 and 2021 [33‒44]. An overview of study characteristics is displayed in Table 2. The median number of total images used in each study for all datasets combined was 5,846 (ranging from 212 to 185,192). The median dataset size for training, testing, and validation was 4,732 images (range: 247–22,608), 362 (range: 100–40,331), and 1,258 (range: 109–14,883), respectively.
Overview of study characteristics
First author, year . | Patient population . | Public availability of dataset . | Image type/Image No. . | Validation (H = histology, C = clinical diagnosis) . | Deep learning system . | Model output . | |
---|---|---|---|---|---|---|---|
ethnicity/ancestry/race/location . | dataset information . | ||||||
Piccolo et al. (2002) [23] | Fitzpatrick I–V | Lesion n = 341Patient n = 289F 65% (n = 188)Average age 33.6 | No | DermoscopyTotal 341Training 0Testing 341Validation 0 | All – H | Network architecture DEM-MIPS (artificial neural network designed to evaluate different colorimetric and geometric parameters) | Binary (melanoma, non-melanoma) |
Iyatomi et al. (2008) [24] | ItalianAustrianJapanese | n/a | No | DermoscopyTotal 1,258Training 247Testing NAValidation 1,258 | Dataset A and B– HDataset C – H + C | Network architectureANN (back-propagation artificial neural networks) | Binary (malignant or benign)Malignancy (risk) score (0–100) |
Chang et al. (2013) [25] | Taiwanese | Lesion n = 769Patient n = 676F 56% (n = 380)Average age 47.6 | No | ClinicalTotal 1,899Training NATesting NAValidation NA | Benign – CMalignant – H | Network architectureComputer-aided diagnosis (CADx) system built on 91 conventional features of shape, texture, and colors (developing software – MATLAB) | Binary (benign or malignant) |
Chen et al. (2016) [26] | American Indian, Alaska Native, Asian, Pacific Islander, black or African, American, Caucasian | Community datasetPatient n = 1,900F 52.3% (n = 993)Age >50% under 35 | PartiallyDermNet NZ – YesCommunity – No | ClinicalTotal 12,000Training 11,780Testing 337Validation NA | Community dataset (benign and malignant) – CDermNet – H | Network architecturePatented image-search algorithm that builds on proven computer vision methods from the field of CBIR | Binary (melanoma and non-melanoma) |
Yang et al. (2017) [27] | Korean | Patient n = 110F 50% (n = 55) | No | DermoscopyTotal 297Training 0Testing 297Validation 0 | All – H | Network architecture3 stage algorithm, pre-processing, stripe pattern detection and automatic discrimination (MATLAB) | Binary (LM, nevus) |
Han et al. (2018) [28] | KoreanCaucasian | n/a | PartiallyMED-NODE, Atlas, Edinburgh, Dermofit – yesOthers – no | ClinicalTotal 182,044Training 178,875Testing 1,276Validation 22,728 | ASAN – C+ HMultiple other dataset (5) used with unclear validation | Neural network – CNNNetwork ArchitectureMicrosoft ResNet – 152Google Inception | 12-class skin tumor types |
Yu et al. (2018) [29] | Korean | Lesion n = 275 | No | DermoscopyTotal 724Training 372Testing 362Validation 109 | All – H | Neural network – CNNNetwork architecture modified VGG – 16 | Binary (melanoma/non-melanoma) |
Zhang et al. (2018) [30] | Chinese | Lesion n = 1,067 | No | DermoscopyTotal 1,067Training 4,867Testing 1,142Validation NA | Benign and malignant – CThree dermatologists disagree – H | Neural network – CNNNetwork architectureGoogLeNet Inception v3 | Four class classifier (BCC, SK, melanocytic nevus, and psoriasis) |
Fujisawa et al. (2019) [31] | Japanese | Patients n = 2,296 | No | ClinicalTotal 6,009Training 4,867Testing 1,142Validation NA | Melanocytic nevus, split nevus, lentigo simplex – COthers – PE | Neural network – DCNNNetwork architectureGoogLeNet DCNN model | 1. Two class classifier (benign vs. malignant)2. Four class classifier (malignant epithelial lesion, malignant melanocytic lesion, benign epithelial lesion, benign melanocytic lesion)3. 14 class classification4. 21 class classification |
Jinnai et al. (2019) [38] | Japanese | Patient n = 3,551 | No | DermoscopyTotal 5,846Training 4,732Testing 666Validation NA | Malignant – HBenign tumor – C | Neural network – FRCNNNetwork architecture – VGG-16 | Binary (benign/malignant)Six-class classifications (6 skin conditions) |
Zhao et al. (2019) [32] | Chinese | n/a | No | ClinicalTotal 4,500Training 3,375Testing 1,125Validation NA | Benign – CMalignant – PE | Neural network – CNNNetwork architecture – Xception | Risk (low/high/dangerous) |
Cho et al. (2020) [33] | Korean | Patient n = 404 | No | ClinicalTotal 2,254Training 1,629Testing 625Validation NA | Benign – CMalignant – H | Neural network – DCNNNetwork architecture – Inception-ResNet-V2 | Binary classification (benign or malignant) |
Han et al. (2020) [36] | KoreanCaucasian | ASNA, Normal, SNUPatient n = 28,222F 55% (n = 15,522)Average age 40MED-NODE, Web, EdinburghNA | PartiallyEdinburgh – YesSNU – upon request | ClinicalTotal 224,181Training 220,680Testing 2,441Validation 3,501 | ASAN – C+ HEdinburgh – HMed-Node – HSNU – C + HWeb – image finding | Neural network – CNNNetwork architectureSENetSe-ResNet-50Visual geometry group (VGG-19) | Binary (malignant, non-malignant)Binary (steroids, antibiotics, antivirals, antifungals)Multiple class classification (134 skin disorders) |
Han et al. (2020) [35] | Korean | Patients n = 673Lesions n = 673F 54% (n = 363)Average age 58 | No | ClinicalTotal 185,192Training 182,348Testing NAValidation 2,844 | All – H | Neural network – CNNNetwork architectureSENetSE-ResNeXt-50SE-ResNet-50) | Risk output (risk of malignancy) |
Han et al. (2020) [34] | Korean | Patient n = 9,556Lesion n = 10,426F 55% (5,255)Average age 52 | No | ClinicalTotal 40,331Training 1106,886aTesting NAValidation 40,331 | All – H | Neural network – RCNNNetwork architectureSENetSe-ResNeXt-50 | Binary (malignant, non-malignant)32 class classification |
Huang et al. (2020) [37] | Chinese | Lesion n = 1,225 | No | ClinicalTotal 3,299Training 2,474Testing 825Validation NA | All – PE | Neural network - CNNNetwork architectureInception V3Inception-ResNet V2DenseNet 121ResNet 50 | Binary (SK/BCC) |
Li (2020) [44] | Chinese | Patient n = 106 | No | Dermoscopy and clinicalTotal 212Training 200,000aTesting 212Validation NA | All – H | Network architectureYouzhi AI software (Shanghai Maise Information Technology Co., Ltd., Shanghai, China) | Binary (benign or malignant)14 class classification |
Liu et al. (2020) [39] | Fitzpatrick type I–VI | Patient n = 15,640Lesion n = 20,676 | No | ClinicalTotal 79,720Training 64,837Testing NAValidation 14,483 | Benign – CMalignant – H | Neural network – DLSNetwork architectureInception – v4 | 26 class classification (primary output)419 class classification (secondary output) |
Wang et al. (2020) [40] | Chinese, with Fitzpatrick type IV | n/a | No | DermoscopyTotal 10,307Training 8,246Testing 1,031Validation 1,031 | BCC – C+HOthers – C | Neural network – CNNNetwork architectureGoogLeNet Inception v3 | Binary classification (psoriasis and others)Multi-class classification |
Huang et al. (2021) [37] | TaiwaneseCaucasian | KCGMHPatient no. 1,222F 52.4% (n = 640)Average age 62HAM10000 n/a | PartiallyKCGMH – noHAM10000 – yes | ClinicalTotal 1,287Training 1,031Testing 128Validation 128 | All – H | Neural network – CNNNetwork architectureDenseNet 121 – binary classificationEfficientNet E4 – five class classification | Binary (benign/malignant)5 class classification (BCC, BK, MM, NV, SCC)7 class (AK, BCC, BKL, SK, DF, MM, NV) |
Minagawa et al. (2021) [42] | CaucasianJapanese | Patient n = 50 | PartiallyISIC–yesShinshu – no | DermoscopyTotal 12,948Training 12,848Testing 100Validation NA | Benign – CMalignant – H | Neural network - DNNNeural architecture – Inception-ResNet-V2 | 4 class classification (MM/BCC/MN/BK) |
Yang et al. (2021) [43] | Chinese | n/a | No | ClinicalTotal 12,816Training 10,414Testing 300Validation 2,102 | All – C | Neural network – DCNNNeural architectureDenseNet-96ResNet-152ResNet-99Converged network (DenseNet – ResNet fusion) | 6 class classification (Nevi, Melasma, cafe-au-lait, SK, and acquired nevi) |
First author, year . | Patient population . | Public availability of dataset . | Image type/Image No. . | Validation (H = histology, C = clinical diagnosis) . | Deep learning system . | Model output . | |
---|---|---|---|---|---|---|---|
ethnicity/ancestry/race/location . | dataset information . | ||||||
Piccolo et al. (2002) [23] | Fitzpatrick I–V | Lesion n = 341Patient n = 289F 65% (n = 188)Average age 33.6 | No | DermoscopyTotal 341Training 0Testing 341Validation 0 | All – H | Network architecture DEM-MIPS (artificial neural network designed to evaluate different colorimetric and geometric parameters) | Binary (melanoma, non-melanoma) |
Iyatomi et al. (2008) [24] | ItalianAustrianJapanese | n/a | No | DermoscopyTotal 1,258Training 247Testing NAValidation 1,258 | Dataset A and B– HDataset C – H + C | Network architectureANN (back-propagation artificial neural networks) | Binary (malignant or benign)Malignancy (risk) score (0–100) |
Chang et al. (2013) [25] | Taiwanese | Lesion n = 769Patient n = 676F 56% (n = 380)Average age 47.6 | No | ClinicalTotal 1,899Training NATesting NAValidation NA | Benign – CMalignant – H | Network architectureComputer-aided diagnosis (CADx) system built on 91 conventional features of shape, texture, and colors (developing software – MATLAB) | Binary (benign or malignant) |
Chen et al. (2016) [26] | American Indian, Alaska Native, Asian, Pacific Islander, black or African, American, Caucasian | Community datasetPatient n = 1,900F 52.3% (n = 993)Age >50% under 35 | PartiallyDermNet NZ – YesCommunity – No | ClinicalTotal 12,000Training 11,780Testing 337Validation NA | Community dataset (benign and malignant) – CDermNet – H | Network architecturePatented image-search algorithm that builds on proven computer vision methods from the field of CBIR | Binary (melanoma and non-melanoma) |
Yang et al. (2017) [27] | Korean | Patient n = 110F 50% (n = 55) | No | DermoscopyTotal 297Training 0Testing 297Validation 0 | All – H | Network architecture3 stage algorithm, pre-processing, stripe pattern detection and automatic discrimination (MATLAB) | Binary (LM, nevus) |
Han et al. (2018) [28] | KoreanCaucasian | n/a | PartiallyMED-NODE, Atlas, Edinburgh, Dermofit – yesOthers – no | ClinicalTotal 182,044Training 178,875Testing 1,276Validation 22,728 | ASAN – C+ HMultiple other dataset (5) used with unclear validation | Neural network – CNNNetwork ArchitectureMicrosoft ResNet – 152Google Inception | 12-class skin tumor types |
Yu et al. (2018) [29] | Korean | Lesion n = 275 | No | DermoscopyTotal 724Training 372Testing 362Validation 109 | All – H | Neural network – CNNNetwork architecture modified VGG – 16 | Binary (melanoma/non-melanoma) |
Zhang et al. (2018) [30] | Chinese | Lesion n = 1,067 | No | DermoscopyTotal 1,067Training 4,867Testing 1,142Validation NA | Benign and malignant – CThree dermatologists disagree – H | Neural network – CNNNetwork architectureGoogLeNet Inception v3 | Four class classifier (BCC, SK, melanocytic nevus, and psoriasis) |
Fujisawa et al. (2019) [31] | Japanese | Patients n = 2,296 | No | ClinicalTotal 6,009Training 4,867Testing 1,142Validation NA | Melanocytic nevus, split nevus, lentigo simplex – COthers – PE | Neural network – DCNNNetwork architectureGoogLeNet DCNN model | 1. Two class classifier (benign vs. malignant)2. Four class classifier (malignant epithelial lesion, malignant melanocytic lesion, benign epithelial lesion, benign melanocytic lesion)3. 14 class classification4. 21 class classification |
Jinnai et al. (2019) [38] | Japanese | Patient n = 3,551 | No | DermoscopyTotal 5,846Training 4,732Testing 666Validation NA | Malignant – HBenign tumor – C | Neural network – FRCNNNetwork architecture – VGG-16 | Binary (benign/malignant)Six-class classifications (6 skin conditions) |
Zhao et al. (2019) [32] | Chinese | n/a | No | ClinicalTotal 4,500Training 3,375Testing 1,125Validation NA | Benign – CMalignant – PE | Neural network – CNNNetwork architecture – Xception | Risk (low/high/dangerous) |
Cho et al. (2020) [33] | Korean | Patient n = 404 | No | ClinicalTotal 2,254Training 1,629Testing 625Validation NA | Benign – CMalignant – H | Neural network – DCNNNetwork architecture – Inception-ResNet-V2 | Binary classification (benign or malignant) |
Han et al. (2020) [36] | KoreanCaucasian | ASNA, Normal, SNUPatient n = 28,222F 55% (n = 15,522)Average age 40MED-NODE, Web, EdinburghNA | PartiallyEdinburgh – YesSNU – upon request | ClinicalTotal 224,181Training 220,680Testing 2,441Validation 3,501 | ASAN – C+ HEdinburgh – HMed-Node – HSNU – C + HWeb – image finding | Neural network – CNNNetwork architectureSENetSe-ResNet-50Visual geometry group (VGG-19) | Binary (malignant, non-malignant)Binary (steroids, antibiotics, antivirals, antifungals)Multiple class classification (134 skin disorders) |
Han et al. (2020) [35] | Korean | Patients n = 673Lesions n = 673F 54% (n = 363)Average age 58 | No | ClinicalTotal 185,192Training 182,348Testing NAValidation 2,844 | All – H | Neural network – CNNNetwork architectureSENetSE-ResNeXt-50SE-ResNet-50) | Risk output (risk of malignancy) |
Han et al. (2020) [34] | Korean | Patient n = 9,556Lesion n = 10,426F 55% (5,255)Average age 52 | No | ClinicalTotal 40,331Training 1106,886aTesting NAValidation 40,331 | All – H | Neural network – RCNNNetwork architectureSENetSe-ResNeXt-50 | Binary (malignant, non-malignant)32 class classification |
Huang et al. (2020) [37] | Chinese | Lesion n = 1,225 | No | ClinicalTotal 3,299Training 2,474Testing 825Validation NA | All – PE | Neural network - CNNNetwork architectureInception V3Inception-ResNet V2DenseNet 121ResNet 50 | Binary (SK/BCC) |
Li (2020) [44] | Chinese | Patient n = 106 | No | Dermoscopy and clinicalTotal 212Training 200,000aTesting 212Validation NA | All – H | Network architectureYouzhi AI software (Shanghai Maise Information Technology Co., Ltd., Shanghai, China) | Binary (benign or malignant)14 class classification |
Liu et al. (2020) [39] | Fitzpatrick type I–VI | Patient n = 15,640Lesion n = 20,676 | No | ClinicalTotal 79,720Training 64,837Testing NAValidation 14,483 | Benign – CMalignant – H | Neural network – DLSNetwork architectureInception – v4 | 26 class classification (primary output)419 class classification (secondary output) |
Wang et al. (2020) [40] | Chinese, with Fitzpatrick type IV | n/a | No | DermoscopyTotal 10,307Training 8,246Testing 1,031Validation 1,031 | BCC – C+HOthers – C | Neural network – CNNNetwork architectureGoogLeNet Inception v3 | Binary classification (psoriasis and others)Multi-class classification |
Huang et al. (2021) [37] | TaiwaneseCaucasian | KCGMHPatient no. 1,222F 52.4% (n = 640)Average age 62HAM10000 n/a | PartiallyKCGMH – noHAM10000 – yes | ClinicalTotal 1,287Training 1,031Testing 128Validation 128 | All – H | Neural network – CNNNetwork architectureDenseNet 121 – binary classificationEfficientNet E4 – five class classification | Binary (benign/malignant)5 class classification (BCC, BK, MM, NV, SCC)7 class (AK, BCC, BKL, SK, DF, MM, NV) |
Minagawa et al. (2021) [42] | CaucasianJapanese | Patient n = 50 | PartiallyISIC–yesShinshu – no | DermoscopyTotal 12,948Training 12,848Testing 100Validation NA | Benign – CMalignant – H | Neural network - DNNNeural architecture – Inception-ResNet-V2 | 4 class classification (MM/BCC/MN/BK) |
Yang et al. (2021) [43] | Chinese | n/a | No | ClinicalTotal 12,816Training 10,414Testing 300Validation 2,102 | All – C | Neural network – DCNNNeural architectureDenseNet-96ResNet-152ResNet-99Converged network (DenseNet – ResNet fusion) | 6 class classification (Nevi, Melasma, cafe-au-lait, SK, and acquired nevi) |
BCC, basal cell carcinoma; BK, benign keratosis; CNN OR DCNN, convolutional neural network; DF, dermatofibroma; SCC, squamous cell carcinoma; SK, seborrheic keratosis; MM, melanoma; MN, melanocytic; PE, pathological examination (insufficient information provided whether histopathology and/or clinical evaluation was used); n/a, not available; n, number; CBIR, content-based image retrieval.
aAlgorithm previously trained on an different dataset; therefore, dataset numbers are not included.
The majority of studies (15/22, 68%) analyzed clinical images (i.e., wide-field or regional images), while seven studies analyzed dermoscopy images [23, 24, 27, 29, 30, 40, 42], and one study included both [44]. All but one study included both malignant and benign pigmented skin lesions, with one investigating only benign pigmented facial lesions [43].
Histopathology was used as the ground truth in 15 studies for all malignant lesions and partially in two studies [24, 26], while one study only used histopathology to resolve clinician disagreements [23]. Seven studies used histopathology as ground truth for benign lesions [23, 27, 29, 34, 35, 41, 44]. In nine studies, ground truth was established by consensus of experienced dermatologists [25, 30‒32, 38‒40, 42, 43]. Other studies used a mix of both [24, 26, 33, 36] or were not clearly defined [28, 37].
The number of pigmented skin lesion classifications used for AI model evaluation ranged from binary outcomes (e.g., benign vs. malignant) to classification of up to 419 skin conditions [39]. While most studies (19/22, 86%) evaluated lesions across all body sites, one study exclusively analyzed the lips/mouth [33], another assessed only facial skin lesions [43], and one study specifically addressed acral melanoma [29].
Population
Homogenous datasets were collected from the Chinese/Taiwanese (n = 8, 36%) [25, 30, 32, 37, 40, 41, 43, 44], Korean (n = 5, 23%) [27‒29, 33‒35], and Japanese populations (n = 3, 14%) [31, 38, 42]. Seven studies (32%) included populations from Caucasians/Fitzpatrick skin type I–III [23, 24, 26, 28, 36, 39, 42], with at least 10% American Indian [26], Alaska Native [26], black or African American [26], Pacific Islander [26], Native American [26], or Fitzpatrick IV–VI [23, 39] in the training and/or test set (Table 2).
Outcome and Performance
The outcome of the classification algorithms used either a diagnostic model, risk categorical model (e.g., low, medium, or high), or a combination of both. An overview of AI model performance is described in Table 3. Majority of studies (20/22, 91%) used a diagnostic model, either with binary classification of benign or malignant [23‒27, 29, 33, 35, 37], multiclass classification of specific lesion diagnosis [28, 30, 32, 39, 42, 43], or both [31, 34, 36, 38, 40, 41, 44]. One study used categorical risk as the outcome [32]. Another study reported both diagnostic model and risk categorical model [24].
Measures of output and performance for AI models included in the review
Reference . | Accuracy (%) . | Sensitivity (%) . | Specificity (%) . | AUC . |
---|---|---|---|---|
Binary classification models | ||||
Piccolo et al. (2002) [23] | n/a | 92 | 74 | n/a |
Iyatomi et al. (2008) [24] | n/a | 86 | 86 | 0.93 |
Chang et al. (2013) [25] | 91 | 86 | 88 | 0.95 |
Chen et al. (2016) [26] | 91 | 90 | 92 | n/a |
Yang et al. (2017) [27] | 99.7 | 100 | 99 | n/a |
Yu et al. (2018) [29] | 82 | 93 | 72 | 0.80 |
Cho et al. (2020) [33] | n/a | Dataset 1: 76Dataset 2: 70 | Dataset 1: 80Dataset 2: 76 | Dataset 1: 0.83Dataset 2: 0.77 |
Huang et al. (2020) [37] | 86 | n/a | n/a | 0.92 |
Han et al. (2020) [35] | n/a | 77 | 91 | 0.91 |
Fujisawa et al. (2019) [31] | 93 | 96 | 90 | n/a |
Jinnai et al. (2019) [38] | 92 | 83 | 95 | n/a |
Han et al. (2020) [36] | n/a | n/a | n/a | Edinburgh dataset: 0.93SNU dataset: 0.94 |
Han et al. (2020) [34] | n/a | Top 1: 63 | Top 1: 90 | 0.86 |
Li et al. (2020) [44] | 86 | 75 | 93 | n/a |
Wang et al. (2020) [40] | 77 | n/a | n/a | n/a |
Multiclass classification models | ||||
Han et al. (2018) [28] | n/a | ASAN dataset: 86Edinburg dataset: 85 | ASAN dataset: 86Edinburg dataset: 81 | |
Zhang et al. (2018) [30] | Dataset A: 87Dataset B: 87 | n/a | n/a | |
Fujisawa et al. (2019) [31] | 77 | n/a | n/a | |
Jinnai et al. (2019) [38] | 87 | 86 | 87 | |
Liu et al. (2020) (26-classification model) [39] | Top 1: 71Top 3: 93 | Top 1: 58Top 3: 88 | n/a | |
Han et al. (2020) [36] | Top 1Edinburgh dataset: 57SNU dataset: 45Top 3Edinburgh dataset: 84SNU dataset: 69Top 5Edinburgh dataset: 92SNU dataset: 78 | n/a | n/a | |
Han et al. (2020) [34] | Top 1: 43Top 3: 62 | n/a | n/a | |
Li et al. (2020) [44] | 73 | n/a | n/a | |
Wang et al. (2020) [40] | 82 | n/a | n/a | |
Minagawa et al. (2021) [42] | 90 | n/a | n/a | |
Yang et al. (2021) [43] | Algorithm A: 88Algorithm B: 77Algorithm C: 90Algorithm D: 87 | Algorithm A: 83Algorithm B: 63Algorithm C: 81Algorithm D: 80 | Algorithm A: 98Algorithm B: 90Algorithm C: 99Algorithm D: 98 | |
Huang et al. (2021) [37] | 5 class (KCGMH dataset): 727 class (HAM10000 dataset): 86 | n/a | n/a | |
Risk categorical classification | ||||
Zhao et al. (2019) [32] | 83 | Benign: 93Low risk: 85High risk: 86 | Benign: 88Low risk: 85High risk: 91 | Benign: 0.96Low risk: 0.92High risk: 0.95 |
Reference . | Accuracy (%) . | Sensitivity (%) . | Specificity (%) . | AUC . |
---|---|---|---|---|
Binary classification models | ||||
Piccolo et al. (2002) [23] | n/a | 92 | 74 | n/a |
Iyatomi et al. (2008) [24] | n/a | 86 | 86 | 0.93 |
Chang et al. (2013) [25] | 91 | 86 | 88 | 0.95 |
Chen et al. (2016) [26] | 91 | 90 | 92 | n/a |
Yang et al. (2017) [27] | 99.7 | 100 | 99 | n/a |
Yu et al. (2018) [29] | 82 | 93 | 72 | 0.80 |
Cho et al. (2020) [33] | n/a | Dataset 1: 76Dataset 2: 70 | Dataset 1: 80Dataset 2: 76 | Dataset 1: 0.83Dataset 2: 0.77 |
Huang et al. (2020) [37] | 86 | n/a | n/a | 0.92 |
Han et al. (2020) [35] | n/a | 77 | 91 | 0.91 |
Fujisawa et al. (2019) [31] | 93 | 96 | 90 | n/a |
Jinnai et al. (2019) [38] | 92 | 83 | 95 | n/a |
Han et al. (2020) [36] | n/a | n/a | n/a | Edinburgh dataset: 0.93SNU dataset: 0.94 |
Han et al. (2020) [34] | n/a | Top 1: 63 | Top 1: 90 | 0.86 |
Li et al. (2020) [44] | 86 | 75 | 93 | n/a |
Wang et al. (2020) [40] | 77 | n/a | n/a | n/a |
Multiclass classification models | ||||
Han et al. (2018) [28] | n/a | ASAN dataset: 86Edinburg dataset: 85 | ASAN dataset: 86Edinburg dataset: 81 | |
Zhang et al. (2018) [30] | Dataset A: 87Dataset B: 87 | n/a | n/a | |
Fujisawa et al. (2019) [31] | 77 | n/a | n/a | |
Jinnai et al. (2019) [38] | 87 | 86 | 87 | |
Liu et al. (2020) (26-classification model) [39] | Top 1: 71Top 3: 93 | Top 1: 58Top 3: 88 | n/a | |
Han et al. (2020) [36] | Top 1Edinburgh dataset: 57SNU dataset: 45Top 3Edinburgh dataset: 84SNU dataset: 69Top 5Edinburgh dataset: 92SNU dataset: 78 | n/a | n/a | |
Han et al. (2020) [34] | Top 1: 43Top 3: 62 | n/a | n/a | |
Li et al. (2020) [44] | 73 | n/a | n/a | |
Wang et al. (2020) [40] | 82 | n/a | n/a | |
Minagawa et al. (2021) [42] | 90 | n/a | n/a | |
Yang et al. (2021) [43] | Algorithm A: 88Algorithm B: 77Algorithm C: 90Algorithm D: 87 | Algorithm A: 83Algorithm B: 63Algorithm C: 81Algorithm D: 80 | Algorithm A: 98Algorithm B: 90Algorithm C: 99Algorithm D: 98 | |
Huang et al. (2021) [37] | 5 class (KCGMH dataset): 727 class (HAM10000 dataset): 86 | n/a | n/a | |
Risk categorical classification | ||||
Zhao et al. (2019) [32] | 83 | Benign: 93Low risk: 85High risk: 86 | Benign: 88Low risk: 85High risk: 91 | Benign: 0.96Low risk: 0.92High risk: 0.95 |
Top: top-(n) accuracy represents the fact that the correct diagnosis is among the top n predictions output by the model.
For example, top-3 accuracy means that any of the top 3 highest probability predictions made by the model match the expected answer.
AUC, area under the curve.
The AI models using binary classification (16/22) reported an accuracy ranging from 70% to 99.7%. Of these studies, 6/16 reported ≥90% accuracy [25‒27, 31, 38, 41], three studies reported between 80 and 90% accuracy [29, 37, 44], and one study reported <80% accuracy [40]. Twelve AI models reported sensitivity and specificity as a measure of performance, which ranged from 58 to 100% and 72 to 99%, respectively. Eight studies provided an area under the curve (AUC) with 5/8 reporting values >0.9 [24, 25, 35‒37], with the remaining three models scoring between 0.77 and 0.86 [29, 33, 34].
For the 13 studies using multiclass output (i.e., >2 diagnoses), accuracy of models ranged from 43% to 93%. Six of these studies (6/13) scored <80% accuracy [31, 34, 36, 39, 41, 44], six others scored between 80 and 90% accuracy [30, 32, 38, 40, 42, 43], and one provided sensitivity and specificity of 86% and 86%, respectively, as a measure of performance [28].
Reader Studies
Reader studies, where the performance of AI models and clinician classification is compared, were performed in 14/22 studies, with results provided in Table 4[23, 25, 29, 31‒39, 42, 44]. Six studies compared AI outcomes to classification by experts, e.g., dermatologists [25, 32, 34, 36, 42, 44]. Eight studies compared outcomes for both experts and non-experts, e.g., dermatology residents and general practitioners [23, 29, 31, 33, 35, 37‒39].
Reader studies between AI models and human experts (e.g., dermatologists), and non-experts (e.g., dermatology residents, GPs)
Reference . | AI performance . | Expert performance . | Non-expert performance . |
---|---|---|---|
Piccolo et al. (2002) [23] | Sensitivity: 92%Specificity: 74% | Sensitivity: 92%Specificity: 99% | Sensitivity: 69%Specificity: 94% |
Chang et al. (2013) [25] | Accuracy Melanoma: 91%Non-melanoma: 83%Sensitivity: 86%Specificity: 88% | Accuracy: 81%Sensitivity: 83%Specificity: 86% | |
Yu et al. (2018) [29] | Accuracy: 82%Sensitivity: 93%Specificity: 72%AUC: 0.80 | Accuracy: 81%Sensitivity: 97%Specificity: 67%AUC: 0.80 | Accuracy: 65%Sensitivity: 45%Specificity: 84%AUC: 0.65 |
Huang et al. (2020) [37] | Sensitivity: 90%AUC 0.94 | Sensitivity: 85%Specificity: 90% | Sensitivity: 66%Specificity: 72% |
Han et al. (2020) [35] | Sensitivity: 89%Specificity: 78%AUC: 0.92 | Sensitivity: 95%Specificity: 72%ROC: 0.91 | AccuracyDermatology resident: 94%Non-dermatology clinician: 77%SensitivityDermatology resident: 69%Non-dermatology clinician: 65%AUC Dermatology resident: 0.88Non-dermatology clinician: 0.73 |
Fujisawa et al. (2019) [31] | AccuracyBinary: 92%Multiclass: 75% | AccuracyBinary: 85%Multiclass: 60% | AccuracyBinary: 74%Multiclass: 42% |
Jinnai et al., (2019) [38] | Accuracy: 92%Sensitivity: 83%Specificity: 95% | Accuracy: 87%Sensitivity: 86%Specificity: 87% | Accuracy: 85%Sensitivity: 84%Specificity: 86% |
Zhao et al. (2019) [32] | SensitivityBenign: 90%Low risk: 90%High risk: 75% | SensitivityBenign: 61%Low risk: 50%High risk: 64% | |
Cho et al. (2020) [33] | SensitivityDataset 1: 76%Dataset 2: 70%SpecificityDataset 1: 80%Dataset 2: 76%AUCDataset 1: 0.83Dataset 2: 0.77 | Sensitivity-Without algorithm: 90%-With algorithm: 90%Specificity-Without algorithm: 58%-With algorithm: 61% | SensitivityDermatology resident-Without algorithm: 80%-With algorithm: 85%Non-dermatology clinician-Without algorithm: 65%-With algorithm: 74%SpecificityDermatology resident-Without algorithm: 53%-With algorithm: 71%Non-dermatology clinician-Without algorithm: 46%-With algorithm: 49%AUCDermatology resident-Without algorithm: 0.33-With algorithm: 0.42Non-dermatology clinician-Without algorithm: 0.11-With algorithm: 0.23 |
Han et al. (2020) [36] | Multiclass modelAccuracyTop 1: 45%Top 3: 69%Top 5: 78% | Multiclass modelAccuracy (without algorithm)Top 1: 50%Top 3: 67% (with algorithm)Top 1: 53%Top 3: 74%Binary modelAccuracy-Without algorithm: 77%-With algorithm: 85% | |
Han et al. (2020) [34] | Binary modelSensitivity: 67%Specificity: 87%Multiclass accuracyTop 1: 50%Top 3: 70% | Binary modelSensitivity: 66%Specificity: 67%Multiclass accuracyTop 1: 38%Top 3: 53% | |
Li et al. (2020) [44] | AccuracyBinary: 73%Multiclass: 86% | AccuracyBinary: 83%Multiclass: 74% | |
Liu et al. (2020) [39] | AccuracyTop 1: 66%Top 3: 90% | AccuracyTop 1: 63%Top 3: 75% | AccuracyPrimary care physicianTop 1: 44%Top 3: 60%Nurse practitionerTop 1: 40%Top 3: 55% |
Minagawa et al. (2021) [42] | Accuracy: 71% | Accuracy: 90% |
Reference . | AI performance . | Expert performance . | Non-expert performance . |
---|---|---|---|
Piccolo et al. (2002) [23] | Sensitivity: 92%Specificity: 74% | Sensitivity: 92%Specificity: 99% | Sensitivity: 69%Specificity: 94% |
Chang et al. (2013) [25] | Accuracy Melanoma: 91%Non-melanoma: 83%Sensitivity: 86%Specificity: 88% | Accuracy: 81%Sensitivity: 83%Specificity: 86% | |
Yu et al. (2018) [29] | Accuracy: 82%Sensitivity: 93%Specificity: 72%AUC: 0.80 | Accuracy: 81%Sensitivity: 97%Specificity: 67%AUC: 0.80 | Accuracy: 65%Sensitivity: 45%Specificity: 84%AUC: 0.65 |
Huang et al. (2020) [37] | Sensitivity: 90%AUC 0.94 | Sensitivity: 85%Specificity: 90% | Sensitivity: 66%Specificity: 72% |
Han et al. (2020) [35] | Sensitivity: 89%Specificity: 78%AUC: 0.92 | Sensitivity: 95%Specificity: 72%ROC: 0.91 | AccuracyDermatology resident: 94%Non-dermatology clinician: 77%SensitivityDermatology resident: 69%Non-dermatology clinician: 65%AUC Dermatology resident: 0.88Non-dermatology clinician: 0.73 |
Fujisawa et al. (2019) [31] | AccuracyBinary: 92%Multiclass: 75% | AccuracyBinary: 85%Multiclass: 60% | AccuracyBinary: 74%Multiclass: 42% |
Jinnai et al., (2019) [38] | Accuracy: 92%Sensitivity: 83%Specificity: 95% | Accuracy: 87%Sensitivity: 86%Specificity: 87% | Accuracy: 85%Sensitivity: 84%Specificity: 86% |
Zhao et al. (2019) [32] | SensitivityBenign: 90%Low risk: 90%High risk: 75% | SensitivityBenign: 61%Low risk: 50%High risk: 64% | |
Cho et al. (2020) [33] | SensitivityDataset 1: 76%Dataset 2: 70%SpecificityDataset 1: 80%Dataset 2: 76%AUCDataset 1: 0.83Dataset 2: 0.77 | Sensitivity-Without algorithm: 90%-With algorithm: 90%Specificity-Without algorithm: 58%-With algorithm: 61% | SensitivityDermatology resident-Without algorithm: 80%-With algorithm: 85%Non-dermatology clinician-Without algorithm: 65%-With algorithm: 74%SpecificityDermatology resident-Without algorithm: 53%-With algorithm: 71%Non-dermatology clinician-Without algorithm: 46%-With algorithm: 49%AUCDermatology resident-Without algorithm: 0.33-With algorithm: 0.42Non-dermatology clinician-Without algorithm: 0.11-With algorithm: 0.23 |
Han et al. (2020) [36] | Multiclass modelAccuracyTop 1: 45%Top 3: 69%Top 5: 78% | Multiclass modelAccuracy (without algorithm)Top 1: 50%Top 3: 67% (with algorithm)Top 1: 53%Top 3: 74%Binary modelAccuracy-Without algorithm: 77%-With algorithm: 85% | |
Han et al. (2020) [34] | Binary modelSensitivity: 67%Specificity: 87%Multiclass accuracyTop 1: 50%Top 3: 70% | Binary modelSensitivity: 66%Specificity: 67%Multiclass accuracyTop 1: 38%Top 3: 53% | |
Li et al. (2020) [44] | AccuracyBinary: 73%Multiclass: 86% | AccuracyBinary: 83%Multiclass: 74% | |
Liu et al. (2020) [39] | AccuracyTop 1: 66%Top 3: 90% | AccuracyTop 1: 63%Top 3: 75% | AccuracyPrimary care physicianTop 1: 44%Top 3: 60%Nurse practitionerTop 1: 40%Top 3: 55% |
Minagawa et al. (2021) [42] | Accuracy: 71% | Accuracy: 90% |
In reader studies comparing binary classification between AI and experts (n = 11), one study reported similar diagnostic accuracy/specificity [29], three showed higher accuracy for AI models [25, 31, 38], and two reported higher accuracy in experts [42, 44]. Five studies reported specificity, sensitivity, and AUC instead of accuracy with varying outcomes [23, 32, 33, 35, 37]. For reader studies between AI and non-experts (n = 7), AI showed higher accuracy, specificity, sensitivity, and AUC in most studies [23, 29, 31, 33, 35, 37, 38]. In reader studies that compared multiclass classification between AI and expert readers (n = 5), three studies reported higher top-1 diagnosis (i.e., diagnosis with highest statistical probability) accuracy for AI [31, 34, 44], in one study readers performed better [36], and the last study had similar results between AI and readers [39]. In multiclass reader studies with non-experts, AI reported higher accuracy in both studies (n = 2) [31, 39].
Quality Assessments
Studies included in the review were evaluated against the 25-point CLEAR Derm Consensus Guidelines [22], covering four domains (data, technique, technical assessment, and application). For each checklist item, each study was assessed to determine whether it fully, partially, or did not address the criteria. A summary of results is provided in Table 5, with the individual study scores available in online supplementary eTable 3. Overall, an average score of 17.4 (±2.2) out of a possible 25 points was obtained.
Summary of quality assessment on AI models reviewed
Domain . | Checklist item . | Quality assessment, n (%) . | |||
---|---|---|---|---|---|
fully addressed . | partially addressed . | not addressed . | |||
Data | 1 | Image types | 21 (95) | 1 (5) | 0 (0) |
2 | Image artifacts | 12 (55) | 5 (23) | 5 (23) | |
3 | Technical acquisition details | 22 (100) | 0 (0) | 0 (0) | |
4 | Pre-processing procedures | 20 (91) | 0 (0) | 2 (9) | |
5 | Synthetic images made public if used | 22 (100)a | 0 (0) | 0 (0) | |
6 | Public images adequately referenced | 22 (100) | 0 (0) | 0 (0) | |
7 | Patient-level metadata | 5 (23) | 17 (77) | 0 (23) | |
8 | Skin tone information and procedure by which skin tone was assessed | 3 (14) | 16 (73) | 3 (14) | |
9 | Potential biases that may arise from use of patient information and metadata | 9 (41) | 7 (32) | 6 (27) | |
10 | Dataset partitions | 12 (55) | 9 (41) | 1 (5) | |
11 | Sample sizes of training, validation, and test sets | 7 (32) | 14 (64) | 1 (5) | |
12 | External test set | 3 (14) | 2 (9) | 17 (77) | |
13 | Multivendor images | 20 (91) | 2 (9) | 0 (0) | |
14 | Class distribution and balance | 5 (23) | 15 (68) | 2 (9) | |
15 | OOD images | 2 (9) | 7 (32) | 13 (59) | |
Technique | 16 | Labeling method (ground truth, who did it) | 15 (68) | 7 (32) | 0 (0) |
17 | References to common/accepted diagnostic labels | 22 (100) | 0 (0) | 0 (0) | |
18 | Histopathologic review for malignancies | 16 (73) | 2 (9) | 4 (18) | |
19 | Detailed description of algorithm development | 14 (64) | 6 (27) | 2 (10) | |
Technical Assessment | 20 | How to publicly evaluate algorithm | 5 (23) | 0 (0) | 17 (77) |
21 | Performance measures | 9 (41) | 13 (59) | 0 (0) | |
22 | Benchmarking, technical comparison, and novelty | 15 (68) | 0 (0) | 7 (32) | |
23 | Bias assessment | 10 (45) | 6 (27) | 6 (27) | |
Application | 24 | Use cases and target conditions (inside distribution) | 16 (73) | 6 (27) | 0 (0) |
25 | Potential impacts on the healthcare team and patients | 3 (14) | 13 (59) | 6 (27) |
Domain . | Checklist item . | Quality assessment, n (%) . | |||
---|---|---|---|---|---|
fully addressed . | partially addressed . | not addressed . | |||
Data | 1 | Image types | 21 (95) | 1 (5) | 0 (0) |
2 | Image artifacts | 12 (55) | 5 (23) | 5 (23) | |
3 | Technical acquisition details | 22 (100) | 0 (0) | 0 (0) | |
4 | Pre-processing procedures | 20 (91) | 0 (0) | 2 (9) | |
5 | Synthetic images made public if used | 22 (100)a | 0 (0) | 0 (0) | |
6 | Public images adequately referenced | 22 (100) | 0 (0) | 0 (0) | |
7 | Patient-level metadata | 5 (23) | 17 (77) | 0 (23) | |
8 | Skin tone information and procedure by which skin tone was assessed | 3 (14) | 16 (73) | 3 (14) | |
9 | Potential biases that may arise from use of patient information and metadata | 9 (41) | 7 (32) | 6 (27) | |
10 | Dataset partitions | 12 (55) | 9 (41) | 1 (5) | |
11 | Sample sizes of training, validation, and test sets | 7 (32) | 14 (64) | 1 (5) | |
12 | External test set | 3 (14) | 2 (9) | 17 (77) | |
13 | Multivendor images | 20 (91) | 2 (9) | 0 (0) | |
14 | Class distribution and balance | 5 (23) | 15 (68) | 2 (9) | |
15 | OOD images | 2 (9) | 7 (32) | 13 (59) | |
Technique | 16 | Labeling method (ground truth, who did it) | 15 (68) | 7 (32) | 0 (0) |
17 | References to common/accepted diagnostic labels | 22 (100) | 0 (0) | 0 (0) | |
18 | Histopathologic review for malignancies | 16 (73) | 2 (9) | 4 (18) | |
19 | Detailed description of algorithm development | 14 (64) | 6 (27) | 2 (10) | |
Technical Assessment | 20 | How to publicly evaluate algorithm | 5 (23) | 0 (0) | 17 (77) |
21 | Performance measures | 9 (41) | 13 (59) | 0 (0) | |
22 | Benchmarking, technical comparison, and novelty | 15 (68) | 0 (0) | 7 (32) | |
23 | Bias assessment | 10 (45) | 6 (27) | 6 (27) | |
Application | 24 | Use cases and target conditions (inside distribution) | 16 (73) | 6 (27) | 0 (0) |
25 | Potential impacts on the healthcare team and patients | 3 (14) | 13 (59) | 6 (27) |
aNo studies included synthetic images (checklist item 5), therefore marked as “fully addressed” to not negatively impact quality score.
OOD, out of distribution.
Data
The data domain consists of 15 checklist items; of these, six items were fully addressed by >90% of publications, including checklist items (1) image types, (3) technical acquisition details, (4) pre-processing procedures, and (6) public images adequately referenced. No studies included synthetic images (item 5). Checklist items (2) image artifacts, (9) potential biases, and (10) dataset partitions (e.g., lesion distribution) were only fully addressed by approximately half of the studies. About a third of studies fully addressed the sample size of training, validation, and test sets (item 11) [28, 29, 34, 35, 39, 40, 43]. The most poorly addressed criteria were patient-level metadata (e.g., sex, gender, ethnicity; item 7), skin color information (item 8), using an external test dataset (item 12), class distribution and balance of the images (item 14), and out of distribution images (item 15).
Patient-level metadata (item 7) was partially reported in most studies. While geographical location, hospital location, and race were adequately reported, sex and age specifications were limited, as were anatomical location, relevant past medical history, and history of presenting illness. Skin color information was reported as Fitzpatrick’s skin type in only three studies [23, 39, 40].
Technique
All four checklist items in the technique domain were fully addressed by most publications. Image labeling method (item 16), e.g., information on ground truth, was fully addressed in 68% (n = 15) of publications. All studies used commonly accepted diagnostic labels (item 17). Histopathologic review (item 18) was fully addressed in 73% (n = 16) of studies. A detailed description of algorithm development (item 19) was provided in 64% (n = 14) of studies.
Technical Assessment
Of the four checkpoints, the most poorly addressed item was the public evaluation of the model (item 20), with only five papers fully addressing this item [29, 34‒36, 39] and the majority (n = 17, 77%) not addressing it at all. Four studies reported public availability of their AI algorithm or offered a public-facing test interface for external testing [28, 34‒36]. Performance measures model (item 21) was fully addressed by nine studies. Benchmarking, technical comparison, and novelty (item 22) compared to previously developed algorithm and human specialists were fully addressed by 15 studies, with technical bias assessment (item 23) discussed by 10 studies.
Application
The application domain consisted of two checklist items. Use of cases and target conditions (item 24) was fully addressed by 16 (73%) studies, while potential impacts on the healthcare team and patients (item 25) were fully addressed by only three studies [38, 39, 41].
Discussion
We present, to our knowledge, the first systematic review summarizing existing AI image-based algorithms for the classification of skin lesions in people with skin of color. Our review identified 22 articles where the algorithms were primarily developed on or tested on images from people with skin of color.
Diversity and Transparency
Within the identified AI studies involving skin of color populations, there is further under-representation of darker skin types. We found that Asian populations (Chinese, Korean, and Japanese) were the major ethnic groups included, with limited datasets involving populations from the Middle East, India, Africa, Pacific Islands, and from Hispanic and Native Americans. Only three studies specifically reported skin color categories of type IV–VI and/or black or African American. Two North American studies reported only 4.3% [26] and 6.8% [39] black African American images, and one Italian-led study [23] reported 26.5% of images with Fitzpatrick skin type IV/V.
Adding to the issue was a lack of transparency in reporting skin color in image-based AI studies. A recently published scoping review [17] of 70 studies involving image-based AI algorithms for classification of skin conditions found only 14 (20%) provided details about race and seven (10%) provided detail about skin color. Furthermore, only three studies partially included Fitzpatrick skin type IV–VI populations. The lack of diversity in studies is likely fueled by the shortage of diverse publicly available datasets. A recent systematic review of 39 open access datasets and skin atlases reported that 12 of the 14 datasets that reported country of origin were from countries of predominantly European ancestry [16]. Only one dataset from Brazil reported that 5% of their population contained Fitzpatrick skin type IV–VI images [45], and another dataset comprised images from a South Korean population [36].
Model Performance
AI models in this review showed reasonably high-performance classification using both dermoscopic and clinical images in populations with skin of color. Accuracy was greater than 70% for all binary models reviewed, which is similar to models developed using Caucasian datasets [46]. It has previously been suggested that instead of training algorithms on more diverse datasets, it may be beneficial to develop separate algorithms for different populations [7]. Han et al. [28] demonstrated a significant difference in performance of an AI model trained using skin images from exclusive Asian populations, which had greater accuracy for classifying basal cell carcinomas and melanomas in Asian populations (96% and 96%, respectively) than in Caucasian populations (90% and 88%, respectively).
There is some evidence that adding out-of-distribution images (i.e., those not originally represented in the training dataset) to a pre-existing dataset and re-training can improve classification accuracy among the newly added group while not affecting accuracy among the original group [22]. Another previously suggested method is to use artificially darkened lesion images in the training dataset [47].
Quality Assessments
Our review identified several critical items from the CLEAR Checklist that were insufficiently addressed by the majority of AI imaging studies in populations with skin of color. Overall, discrepancies in image artifact description, lack of skin color information and other patient-level metadata, missing rationale for dataset sizes, and inclusion of external test sets were found. Image artifacts have been previously shown to affect algorithm performance in both dermatological images [48, 49] and clinical photographs [50]. Unclear descriptions of pre-processing of images can cause incorrect diagnosis, as can subtle alteration in color balance or rotation that can result in incorrect classifying melanoma as benign nevus [49]. Furthermore, less than half of the studies reviewed reported anatomical locations of imaged lesions. The body site of a lesion can be informative; for example, skin of color populations have a high prevalence of acral melanomas which are commonly found on the palms, soles, subungual region, and mucous membranes [51]. Future studies should consider adopting established consensus on image acquisition metrics [52] or dermatology-specific guidelines [22] to standardize images and develop robust archives of skin images used for research.
A major concern is that most AI algorithms have only been tested on internal datasets, e.g., those sampled from the same/similar populations. This issue was similarly highlighted in a recent review [18]. AI algorithms are prone to overfitting, which can result in inflated accuracy measures when only tested on internal datasets and significant performance drop when tested with images from a different source [9]. The lack of standardized performance measure model, publicly available diverse datasets, test interfaces, and reader studies presents a barrier to comparability and reproducibility of AI algorithms. An alternative solution is for algorithms to be evaluated on standardized public test datasets, with transparency on algorithm performance [22]. Testing models on external datasets and/or out-of-distribution images is a better indicator of AI model performance in a real-world setting. Given public dermoscopic datasets with diverse skin colors are now available, future algorithms should include external testing as gold standard evaluation [53].
Lastly, only limited studies included in the review addressed the potential impacts on the healthcare team and patients. Future studies would benefit by considering the clinical translation of AI models during the early stages of development and evaluating clinical utility of AI models along with their performance in conjunction with their intended users [36].
Limitations and Outlook
Our systematic review is not without limitations. First, our search was limited to articles published in English. This is particularly of note for populations with skin of color, where English may often not be the primary language. Additionally, as many studies did not report on skin of color status, inclusion was based on assumptions using ethnicity and geographical location. There can be significant variability in skin color even for a population of the same race or geographical location [54]. Reporting bias may also influence the review, as higher performing algorithms are more likely to be reported and accepted for publication.
Conclusion
AI algorithms have great potential to complement skin cancer screening and diagnosis of pigmented lesions, leading to improved dermatology workflows and patient outcomes. The performance and reliability of image-based AI algorithm significantly depend on the quality of data on which it is trained and tested. While this review provides promising development of AI algorithms in skin of color for East Asian populations, there are still significant discrepancies in the number of algorithms developed in populations with skin of color, particularly in darker skin types. Without inclusion of images from populations with skin of color, the incorporation of AI models into clinical practice could lead to missed and delayed diagnoses of neoplasms in people with skin of color, further widening existing health outcome disparities [55].
Statement of Ethics
An ethics statement is not applicable because this study is based exclusively on published literature.
Conflict of Interest Statement
H. Peter Soyer is a shareholder of MoleMap NZ Ltd. and e-derm-consult GmbH and undertakes regular teledermatological reporting for both companies. H. Peter Soyer is a medical consultant for Canfield Scientific Inc., MoleMap Australia Pty Ltd., and Blaze Bioscience Inc. and a medical advisor for First Derm. The remaining authors have no conflicts of interests to declare.
Funding Sources
H. Peter Soyer holds an NHMRC MRFF Next Generation Clinical Researchers Program Practitioner Fellowship (APP1137127). Clare Primiero is supported by an Australian Government Research Training Program Scholarship.
Author Contributions
Yuyang Liu: conceptualization (equal), data curation (lead), formal analysis (equal), methodology (lead), validation (equal), and writing – original draft (lead). Clare Primiero: conceptualization (equal), funding acquisition (equal), methodology (equal), supervision (equal), and writing – review and editing (equal). Vishnutheertha Kulkarni: data curation (supporting), formal analysis (supporting), validation (equal), and writing – review and editing (supporting). H. Peter Soyer: conceptualization (supporting), funding acquisition (equal), supervision (supporting), and writing – review and editing (supporting). Brigid Betz-Stablein: conceptualization (equal), data curation (supporting), formal analysis (supporting), methodology (equal), supervision (equal), and writing – review and editing (equal).
Data Availability Statement
This review is based exclusively on published literature. No publicly available datasets were used and no data were generated. Further inquiries can be directed to the corresponding author.