Abstract
Digital measures are becoming more prevalent in clinical development. Methods for robust evaluation are increasingly well defined, yet the primary barrier for digital measures to transition beyond exploratory usage often relies on a comparison to the existing standards. This article focuses on how researchers should approach the complex issue of comparing across assessment modalities. We discuss comparisons of subjective versus objective assessments, or performance-based versus behavioral measures, and we pay particular attention to the situation where the expected association may be poor or nonlinear. We propose that, rather than seeking to replace the standard, research should focus on a structured understanding of how the new measure augments established assessments, with the ultimate goal of developing a more complete understanding of what is meaningful to patients.
Introduction
Digital measures, derived using computational methods from at-home monitoring technologies, including wearables and smartphones [1], offer a range of benefits. These include objective, continuous insights into patient behavior and physiology that are unencumbered by recall effects observed in subjective, patient-reported tools. Digital measures also provide a way to detect intermittent/rare events or create novel measures that more sensitively assess patient experience [2, 3].
Digital measures are becoming increasingly adopted into clinical trials and clinical practice [4‒6]. At the same time, the field is becoming ever more aligned on best practice for evaluation of what constitutes a fit-for-purpose tool [7, 8]. Nevertheless, the most significant barrier to demonstrating fit for purpose is clinical validation, typically the barrier for a new measure to transition beyond exploratory usage. Can a digital measure replace the established measure? Should both be used in combination? What claims can be made based on the digital measure?
The critical step in evaluation of a digital measure is the determination of whether the measure is able to capture “clinically meaningful” changes. This is a highly complex question requiring examination of change on both group and individual levels [9] to ensure that the patient experience of change is accurately captured [10]. While the US Food and Drug Administration (FDA) has embraced patient-focused drug development, which includes selecting endpoints that matter to patients, there may be limited tools available to generate evidence to support claims based on these tools [11]. For example, the existence of an effective therapy can enable critical experiments in evaluating a novel measure (steroid treatment in Duchenne muscular dystrophy was central in evidence generation for qualification of a wearable-derived measure of stride velocity [12, 13]), but what if such interventions do not yet exist or are not effective? In practice, evaluation will therefore focus on comparison of established assessments to the novel digital measure, where the established assessment has been shown to support clinically valid inferences. A strong association provides validation by proxy to the new measure, but what if there is not a good association? What counts as “good”?
This is reflected in the very few health authority-accepted or -qualified digital measures [13, 14], and very few documented examples of digital measures being used as primary outcomes [15]. While many digital measures have progressed to later stages of evaluation and wider use, they are “apples-to-apples” cases which capture a remote version of an established assessment, for example, FLOODLIGHT [16], mPOWER [17], Cognition Kit [18], or PARADE [19].
This perspective piece focuses on how researchers should approach the complex issue of comparing across assessment modalities, with the ultimate goal of developing more patient-centered measures. In scope are comparisons of subjective versus objective assessments, or performance-based versus behavioral measures, and we pay particular attention to the situation where the expected association may be poor or nonlinear.
Barriers to Adoption
The rising uptake of digital technologies into clinical trials is a highly promising indication of increasing confidence in, and availability of, digital measures [4]. Nevertheless, a lack of standardization across devices (what is worn/where/when, or what is measured/derived as endpoints, etc.) [20] and a lack of transparency in validation of devices (evidence-derived values are reliable and accurate; comparability of similar values across devices, etc.) [21] still remain as barriers to widespread adoption and incorporation of these tools into clinical decision making.
Traditionally, fit-for-purpose wearables/digital data were examined using the standard patient-reported outcome (PRO) strategy and included examining wearable data against scores from PRO measures (PROMs). In recent years, several recommendations and frameworks have been proposed for the evaluation of digital measures (e.g., see: CPATH [22], CTTI [23], a joint framework proposed by members from the Drug Information Association’s [DIA] Study Endpoint Community, CTTI, the ePRO Consortium, and the Digital Medicine Society [(DiMe]) [24], “V3” [7]) which has led to new guidance documents being released by health authorities including the European Medicines Agency Innovation Task Force [25‒27] or the FDA Center for Drug Evaluation and Research [28]. These guidelines aim to improve our ability to use digital data and offer ever clearer mechanisms for early engagement [29]. Together, they should lead to greater adoption of wearable/digital data within clinical research [30].
Concepts versus Measures
How a new digital measure is established and evaluated is determined by whether what is being measured (the “concept”) is novel, whether the measurement itself is novel, or both [31]. It is also crucial that measures matter to patients by assessing meaningful aspects of health (MAH).
MAH broadly defines an aspect of a disease that a patient (a) does not want to become worse, (b) wants to improve, or (c) wants to prevent [32, 33]. In the case of Duchenne muscular dystrophy, the patients report important, specific activities of their day such as not being able to navigate stairs or desires to walk longer distances [34]. Such activities can be grouped into an MAH category under ambulatory activities that can be readily assessed in a real-world setting. Once the MAH is identified, a concept of interest (COI) can be defined. The COI is a simplified or narrowed element of a MAH that can be practically measured as shown in Figure 1.
On the right-hand side, the hierarchy links meaningful aspects of health (MAHs) to concepts of interest (COIs) to outcomes and endpoints. On the left-hand side, critical patient input is highlighted. Defining MAH and COI should take precedence over technical aspects of defining outcomes and endpoints. For a given individual and condition, multiple MAHs can be relevant, equally multiple COIs can inform a given MAH. These relationships may change over time and across individuals. Reproduced with permission from [68].
On the right-hand side, the hierarchy links meaningful aspects of health (MAHs) to concepts of interest (COIs) to outcomes and endpoints. On the left-hand side, critical patient input is highlighted. Defining MAH and COI should take precedence over technical aspects of defining outcomes and endpoints. For a given individual and condition, multiple MAHs can be relevant, equally multiple COIs can inform a given MAH. These relationships may change over time and across individuals. Reproduced with permission from [68].
Selecting an appropriate COI that is meaningful to patients is a crucial step in narrowing the MAH into a targeted aspect for actual measurement before selecting sensors or devices that can capture specific, measurable characteristics of the disease, and before symptom to sensor mapping occurs. In some cases, a digital measure focuses on remote capture of an already established assessment or battery of assessments. mPOWER [17], Cognition Kit [18], FLOODLIGHT [16], and PARADE [19], for example, all primarily focus on increasing sampling density and lowering patient burden for assessments which are known to be relevant for their respective indications. A direct comparison can be made between the existing assessments and their digital implementation.
More challenging is the development of novel measures which address an established concept, such as a novel measure of behavior which must be compared to established performance metrics. Measures of real-world mobility in sarcopenia [35, 36] or schizophrenia [37] (in a small case-control study), for example, can be compared to established objective measures which address the same concepts in those conditions. Similarly, a sensor-derived, objective digital measure may address a concept which is also captured by a PRO, for example, social anxiety [38] (in a small pilot study), mood disorders [39] (in a small pilot study), depression [40], or stress [41]. In all these cases, development and validation of the new measure heavily rely on comparisons to established measures that address the same or a related concept, which can often include cross-measurement modalities (e.g., behavior versus performance, objective versus subjective measures).
Where the concept itself also needs to be established is a highly challenging situation, that is, where no reference measures exist, but this is out of scope for this article as it has been covered elsewhere [33].
Apples and Pineapples
Perhaps the greatest strength of subjective assessments is that they directly answer the question of whether the patient “feels” better, specifically whether they perceive improvements in their quality of life or related MAH.
One of the most prevalent issues that has likely deterred the widespread adoption of digital/wearable outcomes in clinical trials is the discordance between the objective variables obtained from mobile devices and wearables and PROs. Associations of PROMs and digital measures have been mixed. While objective wearable data often demonstrate lower associations with PROM scores of purported similar concepts (sleep quality, scratching) than layman expectations would posit [42‒45], other areas have found PROMs to be highly aligned with sensor-derived variables [46, 47]. For instance, Bahej et al. [47] used objective measures to forecast PROs and achieved accuracies of around 70–80% for predicting subjectively reported mobility. Similar analyses have found equally promising results across objective and subjective outcomes in cognition [48, 49] and stress [41].
While these results could be interpreted as evidence that objective/digital data and subjective/PROs are assessing similar constructs (and such analyses are useful for providing supportive evidence regarding the validity of new digital data outcomes), we argue that these comparisons, regardless of results (supportive or not), may often times not be appropriate, that is, we are comparing apples and pineapples and attempting to interpret results that may not be conceptually meaningful.
While values/scores from a PROM and a digital variable can be given the same name (e.g., sleep efficiency, physical functioning) that does not mean the same concept is actually being measured by different modalities/devices. For instance, a self-rated physical functioning PRO score may reflect a patient’s lived experience of physical functioning due to a disease/condition. At the same time, this patient may still be able to function physically (e.g., measured by steps per day) at a high level relative to other patients with the same disease/condition due to their precondition health status. Neither the patient-reported or digital variable is inaccurate or “wrong” in this case, it rather demonstrates one of the many possible ways in which subjective/PROM and objective/digital variables could both be “correct” and not be in agreement.
As a more concrete example, take the research area of atopic dermatitis, which is characterized by inflamed skin: there are PROMs to assess itch severity, skin pain, sleep (which can be disturbed by severe dermatitis), and patient-reported dermatitis severity, and there are also actigraphy devices and other digital methods to objectively measure scratching behaviors and sleep parameters [50]. Results across these 2 sources of patient information have found PROMs for itch or sleep and objective scratching/sleep variables to exhibit extremely limited associations. While the terms “itching” and “scratching” are often used interchangeably in everyday language, it is important to note that according to their formal definitions, they are unique [51‒53]. Itch and scratch are well-defined terms referring to “an uneasy irritating sensation in the upper surface of the skin usually held to result from mild stimulation of pain receptors” and “to scrape or rub lightly (as to relieve itching),” respectively [54, 55]. So while they are related, even at their base definition one is a subjective experience (itching) and the other is an observable behavior (scratching); it seems reasonable to posit that if a person is itchy then they are also scratching but there are other possible relationships. Given the chronic condition and knowledge that scratching can exacerbate dermatitis, a patient may be extremely itchy but does not scratch. Due to the chronic nature of the condition, another patient may have developed a nonconscious habit of scratching regardless of their itch level. While some researchers have attributed the lack of a relationship between objective and subjective itch/scratch measures to limitations in the subjective ratings [43], it is also plausible that conceptually different things are being measured, and, when viewed in this light, the lack of association could be considered supportive of the discriminant validity of both types of measures.
Further examples can be found in the perception of pain and clinically measured joint function from the Osteoarthritis Initiative [56]. This large initiative has provided several examples of how perceived pain and clinically measured parameters, both of which we would argue are critical to assessing joint function, often bear very weak relationships to each other. Examples include hand pain and joint deterioration assessed by clinical imaging [57], or physical activity and knee pain [58]. While there is a very weak direct relationship (a weakly negative linear association), the importance of measuring joint function from multiple perspectives is demonstrated by perceived pain, when present over longer periods of time, being predictive of longer-term clinical outcomes [59].
In addition to the possible conceptual differences across subjective and objective measures, there are also technical data differences that make comparisons across these 2 sources of patient input questionable. One of the key benefits of digital data, as noted previously, is that it provides a continuous record of what is being measured (spatial coordinates that are translated into number of steps, etc.). While this eliminates the need for recall (e.g., “In the last 2 weeks, I have had trouble walking a city block.”), which may be prone to bias or error (see, e.g., Stull et al. [60], for a review of selecting appropriate recall periods and possible sources of bias in PRO recall), it also produces an enormous amount of data. The most common approach to working with the mass of data points is to create summary scores (e.g., hourly/nightly/daily/weekly summary values from means, medians, or other statistics). While aggregation is obviously necessary at some level for analyses to be feasible and results to be interpretable (i.e., analysis of second-by-second tracking of patients’ heart rate is not likely useful or interpretable), creating weekly or biweekly averages from digital data variables (to match common recall periods used by PROMs) does not guarantee that a 2-week recall PROM and a 2-week summary of contemporaneously collected data will be well aligned. Additionally, creating such summaries from digital data variables discards an enormous amount of data (e.g., variability within a person from hour to hour, variability from day to day or night to night, or variability within a person on weekdays versus weekends). To effectively leverage the data collected via digital devices into useful and nuanced statistical results regarding patient health, researchers and analysts will likely need to move away from analyses typically used with clinical trial data and begin investigating novel analysis methods (such as n-of-1 analyses [61, 62], intensive longitudinal models [63, 64], or random effect models [65, 66]) to fully access the knowledge that is waiting to be uncovered in the wealth of information digital data collection provides, and answer a question of most importance to patients, “Based on my personal characteristics, what can I expect my outcomes to be?”
A Path Forward
Given the complex relationships among “similar” variables and analysis considerations, care and thought is needed in specifying expected relationships among objective and subjective assessments of purportedly similar constructs, particularly if a goal is to provide supportive evidence regarding the construct validity of a new digital measure. In collecting information to support the validity of inferences made from any variable, the ideal is to use a “gold standard” measure and demonstrate that the new measure results in similar conclusions regarding patient outcomes on the concept intended to be measured. Our targeted summary of relevant literature has established that PROMs of purportedly similar concepts are likely not appropriate for this use when attempting to establish the validity of digital data variables. We believe that this is primarily due to the fact that, rather than one source of data being “correct” or “incorrect,” or more or less accurate, PROMs and digital devices are unique tools for addressing different questions. Rather than pitting them against one another, researchers and regulators would likely best be served by adopting the perspective that information derived from each serves to broaden and deepen our understanding of health-related concepts that are important to patients when assessing the benefits/drawbacks of interventions. This thinking also applies to outcome assessments such as the 6-minute walk test that, from the patient and clinician investigator perspectives, are considered neither to be patient-centered nor gold [67]. Where a COI as multifaceted as mobility and independence (as in the case of the 6-minute walk test) is important to patient quality of life, we would argue that it is extremely risky to only address this COI using a single assessment. Equally, if there are aspects of this COI that are not covered by existing measures and for which novel measures are developed, then requiring these novel measures to perform very similarly to an existing assessment would appear to be self-defeating.
Previous research and current recommendations [33] imply that alternate, more objective measures are likely to be the most useful in establishing that a new digital data source is assessing the concept it intends to assess (e.g., actigraphy steps confirmed by video capture; scratches per hour confirmed by multiple observer ratings); as noted in Walton et al. [33], the feature of digital data variables they term “analytical validity,” encompassing “technical performance characteristics such as their accuracy, reliability, precision, consistency over time, uniformity across mobile sensor generations and/or technologies, and across different environment conditions” and content validity for some outcomes derived from digital data sources will be inextricably linked. Regardless of the source of variables being used when attempting to validate a new measure, researchers should make specific, testable hypotheses for the relationships to be tested for COI meaningful to patients prior to interacting with data and be able to justify the a priori expectations through theory and/or previous research.
Finally, given our preferred perspective that PROMs and digital data outcomes provide unique but likely complementary information, a more systematic analytical program should be undertaken to fully understand how objective and subjective outcomes that, on the surface, are measuring similar concepts relate to one another. Rather than just correlating a PROM score and a variable derived from digital data and despairing when the association is low, systematic research examining multiple PROMs and variables from multiple devices should be investigated using methods such as longitudinal latent variable models or item response theory to construct an empirically supported understanding of how these variables fully relate to one another. Given the digital data status as the “new kid on the block,” it may also be the case that until a preponderance of evidence is available that explicates a meaningful, theoretically based, and logically consistent relationship among subjective and objective measures of broader concepts that claims stemming from digital data may need to focus on the specific feature measured (e.g., steps taken, number of scratches) rather than broader, more nebulous “concepts” (e.g., physical functioning). If qualitative work with patients finds that these specific, digitally measured outcomes are meaningful and understandable to patients (i.e., they exhibit content validity), claims to broader concepts from digital data variables may not even be useful.
Regardless of whether we are using measures from subjective reports or objective, digital technologies, the most important issue is to measure what is meaningful to the people whom we are seeking to help. Where the measures fall in the hierarchy of examined endpoints should be strongly informed by qualitative research with patients, and it seems extremely likely that in most cases, the answer to “Should we use a subjective or objective measure?” will be, “Yes.” While apples and pineapples are both good on their own, who does not like a nice mixed fruit salad?
Acknowledgment
The authors would like to thank Christine Manta for permission to reproduce Figure 1.
Statement of Ethics
This work involves no human subjects, animals, or trial data of any kind, and as such did not require ethics committee approval.
Conflict of Interest Statement
C.R.H. is an employee of Vector Psychometric Group, LLC. B.P.-L. is an employee of and holds stock options in Evidation Health. She consults for Bayer and is on the Scientific Leadership Board of the Digital Medicine Society. I.C. is an employee of and holds stock options in Evidation Health. He has received payment for lecturing on Digital Health at the ETH Zurich and FHNW Muttenz. He is an Editorial Board Member at Karger Digital Biomarkers and a founding member of the Digital Medicine Society. R.J.W. is a managing member and employee of Vector Psychometric Group, LLC.
Funding Sources
This work received no direct funding.
Author Contributions
All authors contributed to the conceptualization, design, drafting, and final approval of the manuscript.