Speech represents a promising novel biomarker by providing a window into brain health, as shown by its disruption in various neurological and psychiatric diseases. As with many novel digital biomarkers, however, rigorous evaluation is currently lacking and is required for these measures to be used effectively and safely. This paper outlines and provides examples from the literature of evaluation steps for speech-based digital biomarkers, based on the recent V3 framework (Goldsack et al., 2020). The V3 framework describes 3 components of evaluation for digital biomarkers: verification, analytical validation, and clinical validation. Verification includes assessing the quality of speech recordings and comparing the effects of hardware and recording conditions on the integrity of the recordings. Analytical validation includes checking the accuracy and reliability of data processing and computed measures, including understanding test-retest reliability, demographic variability, and comparing measures to reference standards. Clinical validity involves verifying the correspondence of a measure to clinical outcomes which can include diagnosis, disease progression, or response to treatment. For each of these sections, we provide recommendations for the types of evaluation necessary for speech-based biomarkers and review published examples. The examples in this paper focus on speech-based biomarkers, but they can be used as a template for digital biomarker development more generally.
Research into digital biomarkers is an area of rapid growth in digital medicine. Recent reviews have shown how digital measures will bring benefits to clinical research and practice [1-6]. For example, digital biomarkers are more ecologically valid measures than many current clinical assessments of cognition and function, allowing assessments to occur during everyday activities or with little instruction. They also offer a promising solution for known problems associated with current clinical tools, especially repeated assessment effects, interrater variability, time investment, and expense. Importantly, the use of digital measures will facilitate remote testing, improving accessibility and reducing the risks inherent in visiting health care centers. The use of digital biomarkers facilitates frequent testing with its potential to provide richer and more detailed data. This approach holds considerable promise for yielding more sensitive measures of symptoms and disease, but many of these potential advantages must be borne out with empirical testing.
Systematic and rigorous evaluation of digital biomarkers is crucial to ensure that they are providing accurate measurement and can serve as suitable surrogate endpoints for detecting and monitoring disease. Following Strimbu and Tavel , we prefer to use the term “evaluation” rather than “validation” to refer to the overall process of assessing a biomarker, since this is a continuous process and may not have a single, conclusive outcome. We approach biomarker evaluation as a systematic series of studies that evaluate and quantify the suitability of a given biomarker and its context for use. In this paper, we review recommendations for digital biomarker evaluation using speech-based biomarkers as an illustrative case [2, 7-11].
Speech offers rich insights into cognition and function and is affected by many psychiatric and neurodegenerative diseases [12-16]. By requiring the coordination of various cognitive and motor processes, even a short sample of speech may provide a sensitive snapshot of cognition and functioning relevant to many disease areas. Speech can encompass a broad range of behaviors, from the simple production of sounds or single words to spontaneous, natural language produced in conversation (Fig. 1). In this paper, we consider speech to include any task involving the oral articulation of sounds and words, but we acknowledge that how speech is elicited and collected can affect its relevance to disease. For example, reading scripted passages may be well-suited to capture changes in acoustic and motoric aspects of speech, but unstructured, open-ended speech tasks may be needed to capture changes in the organization or complexity of language. Structured speech tasks require instruction and active participation, while less structured speech such as interviews or conversations can be collected passively. Speech can be collected with widely available technology, such as smartphones, thereby facilitating remote and frequent monitoring, which can reduce measurement error. With advances in natural language-processing and machine-learning techniques, speech can be automatically and objectively analyzed, producing high-dimensional data.
Together, we argue that the factors discussed above make speech ideally suited for use as a potential biomarker, with applications in several disease areas. Speech-based biomarkers could facilitate more efficient clinical research and more sensitive monitoring of disease progression and response to treatment. Systematic evaluation of the suitability of speech-based biomarkers is required, however, to achieve these goals. In the following sections, we summarize a recently proposed framework for digital biomarker evaluation, with recommendations of how each aspect could be applied to speech-based measures, highlighting examples from the field.
Evaluation of Speech-Based Digital Biomarkers
In recent papers, members of the Digital Medicine (DiMe) society have proposed a framework and common vocabulary for the evaluation of digital biomarkers [2, 9, 17, 18]. This rigorous framework provides an excellent resource for researchers developing digital health tools, and adherence to a common vocabulary will allow for more consistency across evaluation studies. Based on this V3 framework , we summarize the components of evaluation of speech-based biomarkers in the subcategories: verification, analytical validation, and clinical validation. For each, we briefly define the evaluation step, discuss how it could be applied in the context of speech-based biomarkers, and review applicable examples (Fig. 2).
Verification describes the process of validating the hardware and sensors involved in recording a digital measurement . This form of evaluation, sometimes referred to as bench-testing, is often performed without collecting human data. For speech-based measures, verification primarily involves evaluating the recording devices and determining the conditions required for adequate recording quality. This includes environmental conditions like setting, background noise, and microphone placement, and the properties of the recording technology like sampling rate, audio format, and upload and storage parameters, which can vary across devices. It is therefore necessary to define the acceptable devices (e.g., smartphones, tablets, and computer microphones) and conditions (e.g., in the clinic, at home, and real-world environments) for recording speech in the context of a speech biomarker, and to perform quantitative comparisons of the quality of recordings across recording conditions and devices. In addition, the design of the user interface could also affect the quality of recordings, e.g., by including empty periods in the recordings to estimate ambient noise, or by providing feedback to the participant in the form of beeps, text, or voice prompts to improve task compliance.
Several recent papers have examined the comparability of audio recordings made across different devices, including smartphones [19-23]. While several studies found that smartphones yield recordings of acceptable quality compared to standard microphones [20, 22, 24], others reported mixed findings with the differences relating to the particular audio outcome measures used [19, 21, 23]. A few studies have examined the effects of ambient noise on voice recordings, with one recommending a threshold of 50 dBA as an acceptable level for their particular speech tasks and outcome measures [21, 25]. Thus, depending on the type of recording device used and the outcome measures, research is needed to establish which devices and recording parameters are acceptable for speech-based biomarkers. Ideally, these conditions should be able to be identified by the user of the biomarker, e.g., via an audio calibration task, to ensure that recordings are of acceptable quality and will yield reliable results.
Analytical validation involves checking that the measurements obtained via a digital biomarker are accurately measuring the intended phenomena . For speech-based biomarkers, this requires verifying that any property or metric extracted from a speech sample, which we refer to as a feature, measures the associated aspects of speech accurately. The features that can be extracted from speech are numerous and diverse, and they vary according to the task and the processing procedures. Common features include acoustic parameters that reflect mathematical properties of the sound wave, such as fundamental frequency, shimmer, jitter, and Mel-frequency cepstral coefficients (MFCCs). Linguistic features, reflecting parts of speech, vocabulary diversity, syntactic structures, sentiment, and higher-level organization of language, can be obtained using natural language-processing tools or manual coding methods. The type of features used to create a speech-based biomarker will therefore depend on the speech task, processing method, and disease area in question. For example, assessing a motor speech impairment may be best accomplished by examining acoustic features in the context of a structured speech task, like sustained phonation. In contrast, assessing irregularities in the organization and content of speech, reflective of cognitive impairment, may require analyzing the linguistic features derived from a less structured speech task, like picture description (Fig. 1).
The elements of analytical validation will therefore vary based on the relevant features and methods used to compute a specific speech biomarker. As a first step, any speech processing, including segmenting audio based on speaker identity and transcription of the audio into text (via automated or manual methods), needs to be verified for accuracy. This can be accomplished by comparing multiple raters in the case of manual transcription, or by comparing automated to manual methods in the case of automated transcription. Speech processing must also be evaluated to determine how the properties of the recording affect its accuracy. Factors such as a speaker’s age, education, diagnosis, or accent should all be tested to determine if and how they affect the accuracy of speech processing and the subsequent extraction of features. Examples of this type of analytical validation can be found in 2 recent studies comparing automatic speech recognition (ASR) to human transcription [26, 27]. While neither study found ASR to be equivalent to human transcription in terms of accuracy, they both determined the accuracy of ASR in different clinical groups and compared algorithms to select the most accurate processing procedure.
Features extracted from speech vary in terms of the demands for analytical validation. Many acoustic features, such as MFCCs, are computed via mathematical transformations of the sound wave, and can therefore be validated using mathematical models of speech production. As an example of analytical validation of an acoustic feature, one study compared several pitch detection algorithms in terms of accuracy and robustness to background noise for analyzing the voice recordings of patients with Parkinson’s disease, comparing against a gold standard calculation method . This approach to analytical validation is, however, difficult to apply in cases in which gold standard referents do not exist.
Analytical validation can be less straightforward for higher level linguistic features such as the type and characteristics of the words used, the grammatical complexity, or markers of sentiment. Evaluation of such features often involves a comparison with judgments made by human raters or standard linguistic corpora [29, 30], but this can be expensive and time-consuming. In cases in which validation against such standards is impossible or impractical, concurrent validation may be performed by comparing speech measures to metrics obtained from other sensors, like comparing a vocal measure of fatigue to electrophysiological measurements , if available.
Finally, composite measures or machine-learning models based on multiple features present an even more challenging target for validation since there may not be any existing reference measures (i.e., a score reflecting word-finding difficulty could comprise features reflecting speech rate, pauses, errors, etc., and a classification model could be based on the weighted combination of hundreds of variables). In these cases, analytical validation can be carried out on the component features of the composite or model, if possible, with additional clinical validation of the overall outcome score. Analytical validation of models can be achieved by cross-validation, independent replication of results, testing the generalizability of complex speech models to new datasets, and ensuring that confounding factors do not bias the datasets. For example, analytical validation of a classification model for dementia based on speech could involve testing the model on an independent data set, ensuring that factors such as age, sex, education level, and accent were not unevenly distributed in training or testing data sets, leading to a bias in the model.
Another important aspect of analytical validation is determining the mathematical properties of values derived from a measurement. For example, examining the distribution of a measure across a population to determine if it is normally distributed, bimodal, or has significant skew will affect how this measure is interpreted. It is important to determine if there are floor or ceiling effects, and in what situations a measure may not be computed or yield missing data. Other important components of analytical validation include determining if a given measure demonstrates learning effects with repeated testing (and at what intervals), and how it may vary according to demographics such as age, sex, education level, or accent. An example of this type of analytical validation was provided in a study of language measures derived from a picture description task, which reported normative data for each language measure and the respective effects of age, sex, and education .
Clinical validation is the process of evaluating if a digital biomarker provides meaningful clinical information [9, 33]. For example, a digital biomarker could be used for disease diagnosis, measuring disease or symptom severity, monitoring change in disease/symptoms over time, predicting disease onset, or measuring the response to treatment or therapy [8, 9, 11]. An ideal biomarker might serve all of these functions, but some measures may be limited to only one or two of these contexts and still offer significant clinical utility. To determine these types of clinical validity, a suitable clinical reference standard is necessary to define, and novel digital biomarkers should be compared against this standard using appropriate techniques, depending on whether the measures in question are binary, categorical, or continuous.
For some diseases, there may be a clear gold standard reference measure, such as a diagnostic test; however, for many diseases and disorders, the existing measures are themselves surrogate endpoints. For example, current gold standards for a diagnosis of Alzheimer’s disease include the detection of amyloid pathology via the testing of cerebrospinal fluid, PET scan, or autopsy, all of which are invasive and expensive, and are therefore not practical for use in many trials . As a result, a variety of neurological, cognitive, and functional assessments are used as surrogate endpoints in clinical research and practice . Ultimately, the choice of reference measure limits the clinical validation, since a measure can only be shown to be as good as the measure used to validate it. Especially problematic are cases in which a selected reference measure has limited validity for a disease or high interrater variability, since it may be difficult to achieve consistent validity according to this measure and irregularities in the reference measure could introduce bias into the novel biomarker. Thus, it is also important to consider the validity of the reference measure used to assess clinical validity. In cases in which gold standard measures are not available, we recommend comparison with a number of surrogate measures, to provide corroborating validation checks and avoid the limitations of any single measure.
A growing body of work is highlighting the ongoing clinical validation of speech-based measures in a variety of clinical contexts. Speech has been demonstrated to have diagnostic validity for Alzheimer’s disease (AD) and mild cognitive impairment (MCI) in studies using machine-learning classification models to differentiate individuals with AD/MCI from healthy individuals based on speech samples [34-41]. Additionally, speech analysis has been shown to be able to detect individuals with depression [42-45], schizophrenia [46-49], autism spectrum disorder , and Parkinson’s disease [51, 52], and can differentiate the subtypes of primary progressive aphasia and frontotemporal dementia [53-55]. Classification models provide diagnostic validity for speech measures and could be used to develop tools for disease screening and diagnosis. In these types of diagnostic clinical validation studies, it is important to report the accuracy of the classification, as well as other metrics such as sensitivity and specificity. It is also valuable to compare performance with current clinical standards, both for distinguishing the disease from healthy controls and from related diagnoses, to demonstrate the utility of a novel measure. In addition, when possible, exploring what variables drive classification models can help with clinical interpretability, by providing a better understanding of the symptoms and changes that accompany a disease. Interpreting the variables that contribute to a model can help guard against “black box” models and detect artifacts in the data. For example, if a disease is more prevalent in women, and women tend to have higher-pitched voices, the finding that vocal pitch was driving a classification model may only reflect the higher incidence of that disease in women. For diagnostic validation, we therefore recommend reporting both model accuracy and the features that contribute to classification.
Other forms of clinical validation of speech-based measures include measuring disease severity and tracking changes over time as a measure of disease progression, prognosis, or the response to treatment. Various studies have shown how speech measures relate to disease severity by demonstrating associations between speech features and the presence of pathology in primary progressive aphasia  as well as between clinician-rated symptoms and speech features in depression and schizophrenia [57-59]. Several longitudinal case studies suggest that speech features may have prognostic validity for predicting the onset of Alzheimer’s disease and show changes in years prior to the diagnosis [60-65]. This research requires further validation in more general populations. We highlight the need for the collection of longitudinal speech data in clinical populations and the comparison of these measures with current clinical standards. Validated measures for tracking speech changes over time could provide continuous measurement of disease risk, more frequent assessment of disease progression, and better detection of response to treatment.
This article summarizes recommended approaches to the evaluation of digital biomarkers in the context of recent research into speech-based measures in neurological diseases and psychiatric disorders. Speech biomarkers potentially offer many advantages for clinical research and practice; they are objective, naturalistic, can be collected remotely, and require minimal instruction and time compared to current clinical standards. Examples from the literature illustrate the active research in this area, offering promising results for the development of speech-based measures as biomarkers in multiple disease areas including Alzheimer’s disease, Parkinson’s disease, frontotemporal dementia, depression, and schizophrenia. While these findings demonstrate the potential utility of speech-based biomarkers, it is important to note that the particular speech features analyzed differ widely across studies, and no speech measure has yet been comprehensively evaluated across all 3 categories, i.e., verification, analytical validation, and clinical validation. While the measures under development can provide exploratory insights for clinical research, it is necessary to continue to rigorously evaluate speech-based digital biomarkers to achieve valid surrogate endpoints for use in clinical research and practice. The recommendations in this paper are targeted for speech-based biomarkers, but they generalize to other novel digital biomarkers and can be broadly applied.
Conflict of Interest Statement
J.R., L.D.K., W.S., and M.Y. are employees of Winterlight Labs. J.E.H. reports receipt of personal fees in the past 3 years from AlzeCure, Aptinyx, Astra Zeneca, Athira Therapeutics, Axon Neuroscience, Axovant, Biogen Idec, BlackThornRx, Boehringer Ingelheim, Cerecin, Cognition Therapeutics, Compass Pathways, CRF Health, Curasen, EIP Pharma, Eisai, FSV7, G4X Discovery, GfHEU, Heptares, Lundbeck, Lysosome Therapeutics, MyCognition, Neurocentria, Neurocog, Neurodyn Inc, Neurotrack, Novartis, Nutricia, Probiodrug, Regeneron, Rodin Therapeutics, Samumed, Sanofi, Servier, Signant, Syndesi Therapeutics, Takeda, Vivoryon Therapeutics, vTv Therapeutics, and Winterlight Labs. Additionally, he holds stock options in Neurotrack Inc. and is a joint holder of patents with My Cognition Ltd. F.R. is an employee of Surgical Safety Technologies.
F.R. is supported with a CIFAR Chair in Artificial Intelligence.
J.R. drafted the original manuscript. J.E.H., L.D.K, F.R., W.S., and M.Y. all substantively revised the manuscript. All authors approved the submitted version.