Introduction: Progressive cognitive decline is the cardinal behavioral symptom in most dementia-causing diseases such as Alzheimer’s disease. While most well-established measures for cognition might not fit tomorrow’s decentralized remote clinical trials, digital cognitive assessments will gain importance. We present the evaluation of a novel digital speech biomarker for cognition (SB-C) following the Digital Medicine Society’s V3 framework: verification, analytical validation, and clinical validation. Methods: Evaluation was done in two independent clinical samples: the Dutch DeepSpA (N = 69 subjective cognitive impairment [SCI], N = 52 mild cognitive impairment [MCI], and N = 13 dementia) and the Scottish SPeAk datasets (N = 25, healthy controls). For validation, two anchor scores were used: the Mini-Mental State Examination (MMSE) and the Clinical Dementia Rating (CDR) scale. Results: Verification: The SB-C could be reliably extracted for both languages using an automatic speech processing pipeline. Analytical Validation: In both languages, the SB-C was strongly correlated with MMSE scores. Clinical Validation: The SB-C significantly differed between clinical groups (including MCI and dementia), was strongly correlated with the CDR, and could track the clinically meaningful decline. Conclusion: Our results suggest that the ki:e SB-C is an objective, scalable, and reliable indicator of cognitive decline, fit for purpose as a remote assessment in clinical early dementia trials.
Progressive cognitive decline is the cardinal behavioral symptom in most dementia-causing diseases such as Alzheimer’s disease . Whereas classic neuropsychological tests often have excellent psychometric properties to measure cognitive decline in dementia, there are scenarios in which they are less suitable or not applicable at all. Since most traditional assessments are not perfectly suitable for decentralized remote clinical trials, as they require physical presence of clinicians, digital cognitive assessments will gain in importance . Digital cognitive assessments are better suited for automated patient-administered screening or prescreening at low cost to accelerate trial inclusion . Furthermore, a high level of automation can easily scale-up outreach to draw unbiased and representative trial populations beyond established clinical site and hospital networks. Also within clinical trials, digital markers deliver objective, high-frequency data to guide interventional clinical trial decision-making and make evaluation more efficient . Speech-based digital biomarkers are especially promising solutions, as they can be deployed remotely and extracted in a highly automated fashion, allowing monitoring and diagnostic solutions to scale both for clinical trials as well as health care .
Just as traditional “wet” biomarkers, digital speech biomarkers (SBs) measure indicators of normal biological processes, pathogenic processes, or responses to an exposure or intervention . If this biomarker is collected using a digital sensing product, it is called a digital biomarker. If this sensing product captures speech (e.g., using a microphone), it qualifies as a speech digital biomarker or just SB (sometimes also called voice biomarker ).
When referring to SBs, it is important to carefully consider the technical framework that a specific SB is embedded into, as given by its intended use. Depending on this framework, use-case-specific sensor setups need to be evaluated as part of the SB evaluation (e.g., lower audio sampling rate for a telephone-based deployment compared to an app-based usage ).
From a conceptual/clinical point of view, it is important to define how speech readouts connect to a concept of interest (e.g., a symptom or syndrome) and eventually a meaningful aspect of health. This conceptual framework eventually also drives the validation efforts. Conceptually, an SB often does not measure the capability of a human being to talk and use language but rather uses speech as a medium to measure language-distant pathogenic processes such as breathiness in spoken language due to asthma or other respiratory diseases.
As far as SBs-C are concerned, there is a special case in which a SB measures language as a cognitive function, and therefore, the medium (speech) overlaps directly with the concept of interest (speech/language capabilities; for an overview see ). However, this paper focuses on SBs that also measure other aspects of cognition than language such as learning and memory, executive function, or processing speed (for a breakdown of this conceptual framework of ki:e SB-C, see Fig. 1).
Over the last decade, substantial work has been done on algorithms using artificial intelligence and speech analysis to predict cognitive decline in an elderly population. However, the main limitations of the field are poor standardization as well as limited comparability of results , which prevents such innovative solutions from being integrated into clinical practice or trials.
Addressing this challenge, the Digital Medicine (DiME) Society established the V3 framework to evaluate digital assessments (including SBs) as fit for purpose for use in clinical trials . This paper evaluates the ki:e SB-C, following the DiME’s V3 framework, presenting results on verification, analytical, as well as clinical validation from a Dutch elderly clinical cohort.
The ki:e SB-C  is a composite score built up of more than 50 automatically extracted speech features that compose three distinct neurocognitive domain scores (learning and memory, executive function, and processing speed). From the three domain scores, one aggregated global score for cognition is derived. The ki:e SB-C takes as input speech recordings from two standard neuropsychological assessments: the Rey Auditory Verbal Learning Test (RAVLT; ) and the Semantic Verbal Fluency task (SVF). Speech from both tests is automatically processed using the proprietary speech analysis pipeline from ki:elements that involves automatic speech recognition to transcribe speech and feature extraction. Subsequently, the domain scores and the global score for cognition are calculated (for the processing steps, see Fig. 2a). The ki:e SB-C can be collected fully automatically both over traditional landline phone infrastructure as well as in face-to-face on-site settings using mobile front ends.
The ki:e SB-C has been developed based on the 2019–2021-recruited H70 Swedish birth cohort leveraging a population-representative sample of 404 non-demented 75-year-old participants that, among other psychometric testing, also produced recordings of the RAVLT and SVF. From those recordings, meaningful linguistic features have been extracted and combined into neurocognitive domain scores validated by established psychometric anchors for each domain available in the H70 dataset. Please see Figure 2b for a schematic overview of how the ki:e SB-C and its domains are built.
In this paper, we evaluated the ki:e SB-C as fit for-purpose for use in clinical trials in an elderly population potentially being candidates for prodromal or preclinical AD trials. The V3 framework established by the DiME Society  provides a unified evaluation framework for digital tools such as SBs. V3 includes three distinct phases in sequential order: verification, analytical validation, and clinical validation. For all the three phases, different data have to be collected and statistically analyzed to provide the necessary results.
Evaluation of the ki:e SB-C is based on two clinical studies: the Dutch DeepSpA study [13, 14] and the English SPeAk dataset (based on the EPAD generation Scotland cohort ). Both studies collected the ki:e SB-C over telephone or app-based front-end following the same protocol mentioned above. For an overview of the datasets, see Table 1.
140 participants were recruited at the memory clinic of the Maastricht University Medical Center as part of the BioBank Alzheimer Center Limburg study (BBCL, including participants with subjective cognitive impairment – SCI, mild cognitive impairment – MCI or dementia). For this research, we excluded 6 from the original 140 subjects due to bad audio quality and operational issues. Participants underwent a yearly in-person assessment (T0 - baseline and T12 - at 12th month) at the clinic, and during those assessments, the speech data for ki:e SB-C were collected using a mobile application. For the follow-up assessment (T15 - at 15th month), participants were contacted via the fully automatic telephone front-end performing the assessment. Not all participants concluded all timepoints. This was either due to BBCL study procedures, which prevent patients with dementia from participating in follow-ups or especially for the on-site data collection at T12 (Q4 of 2021) due to the COVID-19 pandemic (multiple on-site in clinic visits had to be canceled due to COVID-19 preventive measures). Overall, there were only 29 participants from the initial 140 that concluded both T12 and T15. Mini-Mental State Examination (MMSE ) and Clinical Dementia Rating (CDR) ) data are available for the yearly on-site visits T0 and T12. Based on the CDR total score, which has values of 0, 0.5, 1, 2, 3 corresponding to dementia stages , participants were classified into decliners (N = 10; change from 0 to 0.5 between T0 and T12) and non-decliners (N = 47, no change between T0 and T12).
As part of their on-site visit, all patients underwent a 2-hour extensive neuropsychological assessment in addition to the standard clinical assessments. A clinical diagnosis of MCI or dementia was made based on the regular DSM-V criteria of minor and major neurocognitive disorder and everyday functioning. All other patients were categorized as having SCI since they had cognitive complaints that were not objectified in cognitive testing. Diagnoses were established in the panel including neuropsychologists, caregivers, and neurologists within the standard memory clinic routine.
Twenty-five healthy participants were recruited from the EPAD Generation Scotland readiness cohort. Participants underwent an in-person baseline assessment as part of their yearly cohort assessments (T0) at the clinic (for study protocol see ). The ki:e SB-C was collected during each assessment using a mobile application. For the follow-up assessment (T3 - at 3rd month), participants were contacted via the fully automatic telephone front-end performing the assessment.
Both studies that provided data to this research have been conducted in compliance with the ethical principles for medical research involving human subjects, as defined in the Declaration of Helsinki and the European General Data Protection Regulation. For the Dutch DeepSpA study, the local Medical Ethical Committee (METC MUMC/UM) approved the study (MEC 15-4-100). For the English SPeAk study, the study has been approved by the Edinburgh Medical School Research Ethics Committee (REC reference 20-EMREC-007). All participants provided informed consent before completing any study-related procedures; participants had to have capacity to consent to participate in this study. For both studies, each participant gave written informed consent before the assessment.
Across the V3 phases, different statistical tests are performed on the abovementioned datasets to evaluate the ki:e SB-C.
Verification of SBs entails the systematic evaluation of sample-level sensor outputs against prespecified criteria. The ki:e SB-C uses transcribed speech as input and no acoustic features. Therefore, the most critical part of the sensor output and preprocessing pipeline is the automatic transcription of speech (automatic speech recognition, ASR). The ki:e SB-C uses a proprietary speech processing pipeline based on the Google Speech API . The performance of ASR systems is determined using Word Error Rate (WER). To calculate WER, we compared the ASR performance against manually corrected transcripts from trained clinical personnel. Manual transcripts, ASR transcripts, and WER were obtained for all participants in both datasets. WER is computed as the error between the number of target words in the manual transcripts and that in the ASR transcripts. In the RAVLT, only correctly remembered words are considered for the WER calculation. Based on the previous literature in similar scenarios, a mean WER of 20% is considered acceptable [21-23]. However, across the clinical stages, WER can vary substantially on an individual level, which is why the median WER is a better measure. For verification, we evaluate ASR performance in two different languages: Dutch – DeepSpA and English – SPeAk.
Analytical validation of SBs evaluates performance of the algorithms to measure a certain concept of interest (similar to construct validity). The ki:e SB-C measures cognition as the concept of interest. For the analytical validation, we compared the ki:e SB-C score with a gold standard anchor measure for cognition in this population, evaluate its stability within a short time frame as well as its ability to detect change in a nonclinical sample.
For the comparison with a gold standard anchor, we compute Spearman rank partial correlation between the digital biomarker score and the MMSE score in DeepSpA T0, taking confounding age effects into account. Moreover, we calculated Spearman rank partial correlation between the digital biomarker sub-scores and the MMSE subdomain scores (executive function, processing speed, and MMSE attention/concentration; memory and MMSE delayed recall), to control for the effect of age.
For evaluating measurement stability, we calculate test-retest reliability within 3 months for DeepSpA (T12-T15) and SPeAk (T0-T3). It can be expected that there are little to no learning effects within 3 months as well as no significant clinical progression of patients .
Finally, we evaluate the score’s ability to detect nonclinical, age-related cognitive change by comparing young and old cognitively unimpaired participants (SCI subgroups) in their digital biomarker scores. Therefore, we perform a median split within DeepSpA SCI population (participants that did T0 and T12 follow-up) and perform group comparison in biomarker score between the two age groups using nonparametric Kruskal-Wallis test. Age has a significant effect on cognition , which should result in observable but not clinically meaningful differences in the biomarker score.
Clinical validation of SBs evaluates whether algorithms validly measure clinically meaningful change within an intended scenario including a specified clinical population. The ki:e SB-C is built to measure clinically meaningful change in cognition with a focus on MCI and dementia population. We perform group comparisons in biomarker scores between the different diagnostic groups (ANCOVA controlling for age between SCI, MCI, and dementia groups) and between groups that differ minimally in a clinical anchor (ANCOVA controlling for age with grouping factor CDR-GS 0 vs. CDR-GS 0.5). Additionally, we analyze the correlation between the biomarker score and a clinical anchor measure in the DeepSpA T0 population (Spearman rank partial correlation between ki:e SB-C score and CDR-SOB, controlling for the effect of age). To evaluate the ability of the biomarker to detect disease progression, we compare biomarker score changes from T0 to T12 DeepSpA) between decliners (progression in CDR-GS from 0 to 0.5) and non-decliners (no change in the CDR total score). This was done comparing the SB-C score difference between timepoints T12 and T0 between decliners and non-decliners using nonparametric Kruskal-Wallis test.
We report results according to the V3 framework: verification, analytical validation, and clinical validation.
Results show for both Dutch and English language a median WER of below 16%. Median WER in Dutch DeepSpA sample was higher than in the English SPeAk sample, but both WERs were within the acceptable range (<20%). Additionally, there was no significant difference between the WERs of each task (SVF and VLT phases). Also compare Table 2.
In the Dutch DeepSpA T0 sample, MMSE was strongly correlated with the ki:e SB-C global score (r = 0.54, p < 0.001, d = 1.28; see Fig. 3). In addition, ki:e SB-C sub-scores were strongly correlated with MMSE subdomain scores (executive function and MMSE attention/concentration r = 0.28, p < 0.01, d = 0.58; memory and MMSE-delayed recall r = 0.40, p < 0.001, d = 0.87; processing speed and MMSE attention concentration r = 0.28, p < 0.001, d = 0.59).
Moreover, we found strong correlations between the ki:e SB-C global score at two consecutive assessments 3 months apart both in Dutch (DeepSpA T12 & T15; r = 0.72, p < 0.001, d = 2.05) and English (SPeAk T0 & T3; r = 0.57, p < 0.01, d = 1.40) data. As there are no disease-related changes to be expected in a 3-months’ time frame, this test serves as check for test-retest reliability.
There was a significant difference in the ki:e SB-C global score between median-split age groups ([Age <65] < [Age ≥ 65]) in the Dutch DeepSpA sample both at T0 (χ2 (1) = 10.40, p < 0.01, Cohen’s d = 1.09) and T12 (χ2 (1) = 9.82, p < 0.01, Cohen’s d = 1.05)). These results were also followed by a similar group difference trend in the MMSE scores available from T0 (see Table 3).
Clinical validation was performed on the DeepSpA T0 and T12 samples. There was a significant difference of the ki:e SB-C score between the three clinical groups at DeepSpA T0 (SCI > MCI > dementia; F(2,130) = 31.96, p < 0.001, d = 1.40) (see Fig. 4a) and between groups that differ minimally in a clinical anchor (CDR-GS 0 vs. 0.5) (F(1,118) = 8.79, p < 0.001). Additionally, we observe a strong correlation of the biomarker score with the clinical anchor score CDR-SOB (r = −0.42, p < 0.001, d = −0.93); see Figure 4b. Comparing T12 and T0, the Kruskal-Wallis test revealed a significant difference between decliners and non-decliners in the biomarker score difference between timepoints T12 and T0 (χ2 = 7.55 (1), p < 0.001; see Figure 5).
The digital speech biomarker ki:e SB-C was developed as a measure for cognition in elderly population with a special focus on dementia-related cognitive impairment. The current study aimed at evaluating whether this novel biomarker is fit for purpose in clinical trials through two independent samples using the V3 framework developed by the DiME Society. ki:e SB-C can be calculated automatically and robustly from two standard neuropsychological tests (SVF and RAVLT) and, as our results revealed, can validly measure cognitive function in an elderly population that is at risk of developing dementia.
The novel biomarker ki:e SB-C was systematically evaluated to determine fit for purpose by following the V3 framework. One important aspect and strength of digital biomarkers in general, and ki:e SB-C in particular, is automation. Speech data can be easily collected by telephone or by mobile front-ends face-to-face, and ki:e SB-C can be reliably calculated using the proprietary automatic speech analysis pipeline. Verification step of this study revealed that the automatic speech processing pipeline, including speech recognition, performs at an acceptable level across different languages and tasks, and therefore, automatic processing works reliably for the purposes of the biomarker. The aim of the analytical validation was threefold: to evaluate (1) the ki:e SB-C against a cognitive gold standard measure (MMSE), (2) the biomarker’s retest reliability, and (3) how well ki:e SB-C reflects age-related – but clinically not relevant – changes. The results of the analytical validation analyses revealed that ki:e SB-C was a valid biomarker to measure cognitive abilities that are relevant for the target population, as seen by its high correlation with MMSE scores even if corrected for the effect of age. Furthermore, the automatically calculated biomarker was stable in retesting as assessed by a test-retest analysis.
It is well established that aging is a risk factor for cognitive decline and diseases ; however, not all changes are pathological. It is important for a biomarker that is a surrogate for cognition to reflect subtle changes that are at a subclinical level. Our results showed that ki:e SB-C can detect age-related changes in cognitive function. These changes are also reflected in MMSE scores as a trend. Regarding the direct comparison with the MMSE on the healthy DeepSpA population, the SB-C seems to vary more marked by a larger standard deviation (regarding the mean) as compared to the MMSE. This might be also due to the fact that the SB-C is a more fine-grained measure as compared to the MMSE, which might especially in the healthy population lack the resolution to measure cognitive changes at this higher functioning level and would elicit less variance.
Another important characteristic of a biomarker is its clinical meaningfulness, i.e., a biomarker should be related to the concept of interest that is clinically relevant for the target population and for the context of use. Our results show that ki:e SB-C score is associated with the scores of CDR scale, meaning that the biomarker is related to the existence and severity of dementia. This finding is further supported by the significant difference in the biomarker scores between the diagnostic groups, i.e., ki:e SB-C reflects clinically defined disease stages. Moreover, the biomarker score difference was different between decliners and non-decliners, which further confirms the biomarker’s clinical relevance. Altogether, the ki:e SB-C is sensitive to minimal clinical differences between people and can detect clinical changes within a person.
The results of our validation study are in line with studies using speech as a marker of cognitive health and impairment [27-29]. Our work extends these previous findings by thoroughly and carefully validating (V3 framework) ki:e SB-C as a novel biomarker both in healthy and clinical samples and in two different languages (Dutch and English). To our knowledge, this is the first study that evaluates a speech-based digital marker systematically using the V3 framework .
One limitation of our work is that we could not perform predictive analyses due to the small number of decliners in our sample, which is likely to be related to the pandemic and small longitudinal sample size, and the relatively short follow-up period for neurodegenerative diseases. The next step would be to evaluate the performance of the biomarker ki:e SB-C as a predictor of change over time in a larger longitudinal sample with a longer follow-up period.
The aim of the current study was to validate a novel digital speech-based biomarker for cognitive impairment related to MCI and dementia diagnoses, the ki:e SB-C. The novel biomarker ki:e SB-C has undergone considerable validation following the V3 framework. The results of the validation analyses revealed that ki:e SB-C correctly measures cognitive impairment associated with MCI and dementia. The ki:e SB-C is a promising digital biomarker that has the potential to detect subtle prodromal changes longitudinally as well as to discriminate between diagnostic groups cross-sectionally.
The ki:e SB-C can be utilized to select target populations for clinical studies and may in the future function as a surrogate disease marker. The ki:e SB-C can be further used to identify patients in preclinical disease stages who have a low disease burden as targets for prevention and early treatment.
Statement of Ethics
Both studies that provided data to this research have been conducted in compliance with the Ethical Principles for Medical Research Involving Human Subjects, as defined in the Declaration of Helsinki and the European General Data Protection Regulation. For the Dutch DeepSpA study, the local Medical Ethical Committee (METC MUMC/UM) approved the study (MEC 15-4-100). For the English SPeAk study, the study has been approved by the Edinburgh Medical School Research Ethics Committee (REC reference 20-EMREC-007). All participants provided informed consent before completing any study-related procedures; participants had to have capacity to consent to participate in this study. For both studies, each participant gave written informed consent before the assessment.
Conflict of Interest Statement
Daphne ter Huurne, Nina Possemis, and Inez Ramakers have no conflicts of interest. Johannes Tröger, Ebru Baykara, Jian Zhao, Elisa Mallick, Simona Schäfer, Louisa Schwed, Mario Mina, and Nicklas Linz are employees of the digital biomarker company ki elements. Johannes Tröger and Nicklas Linz hold shares of the digital biomarker company ki elements. Craig Ritchie has received consultancy fees from Biogen, Eisai, MSD, Actinogen, Roche, and Eli Lilly, as well as payment or honoraria from Roche and Eisai.
DeepSpA has been funded by EIT-Health project grant agreement number 19249, as well as supported by Janssen Pharmaceutica NV through a collaboration agreement (award/grant number is not applicable). SPeAk was supported by Janssen Pharmaceutica NV through a collaboration agreement (award/grant number is not applicable).
Johannes Tröger conceptualized this work; he drafted the manuscript and edited the final version. Ebru Baykara contributed to the overall interpretation of the work and drafting of the manuscript. Elisa Mallick, Simona Schäfer, Louisa Schwed, and Mario Mina implemented the biomarker, analyzed the speech, conducted the statistical work, as well as drafted the methods and results sections of this article. Daphne ter Huurne and Nina Possemis acquired parts of the data, contributed to the clinical interpretation of the results, and revised the document. Jian Zhao oversaw the design of the V3 framework validation pipeline from a regulatory standpoint and revised the document. Nicklas Linz contributed to the overall concept of this research and revised the manuscript. Inez Ramakers is responsible for the DeepSpA study and data acquisition, drafted DeepSpA relevant parts of the document, and revised the manuscript. Craig Ritchie is the principal investigator of SPeAk, responsible for the concept and data acquisition, drafted SPeAk relevant parts of the document, and revised the manuscript.
Data Availability Statement
The datasets used for this research might be available to qualified researchers worldwide. Requests are handled by the respective PI or contact person: Craig Ritchie for SPeAk and Inez Ramakers for DeepSpA. Inquiries can be initially directed to the corresponding author.