Objective: In the perceptual assessment of dysarthria, various approaches are used to examine the accuracy of listeners’ speech transcriptions and their subjective impressions of speech disorder. However, less attention has been given to the effort and cognitive resources required to process speech samples. This study explores the relationship between transcription accuracy, comprehensibility, subjective impressions of speech, and objective measures of reaction time (RT) to further examine the challenges involved in processing dysarthric speech. Patients and Methods: Sixteen listeners completed 3 experimental listening tasks: a sentence transcription task, a rating scale task, and an RT task that required responses to veracity statements. In each task, the speech stimuli included speech from 8 individuals with dysarthria. Results: Measurements from the 3 tasks were significantly related, with a correlation coefficient of –0.94 between average RT and transcription-based intelligibility scores and –0.89 between RT and listener ratings of dysarthria. Interrater reliability of RT measurements was relatively low when considering a single person’s response to stimuli. However, reliability reached an acceptable level when a mean was taken from 8 listeners. Conclusions: RT tasks could be developed as a reliable adjunct in the assessment of listener effort and speech processing.

Historically, listener rating scales and orthographic transcriptions have been used as the primary tools for assessing speech impairment in dysarthria [1]. However, there are some limitations inherent in both forms of evaluation. Ceiling effects, listener biases, and time-consuming assessment procedures can all affect a clinician’s ability to reliably monitor changes in the same speaker over time [2-4]. In addition, intelligibility measurements may fail to index some of the subtler effects of dysarthria on speech processing and listener effort. This study seeks to further explore these limitations by examining an objective measurement of speech processing: listener reaction time (RT).

At present, auditory-perceptual assessment methods are the “gold standard” for diagnosing and monitoring speech impairment in speakers with dysarthria [5]. There are 2 main ways in which this is assessed: (1) based on listener transcriptions (a numeric count of words understood) and (2) based on subjective ratings of a speech sample. It is important to highlight that these assessments are not entirely interchangeable. Studies indicate that listeners’ subjective ratings are sensitive to speech differences in people who are 100% intelligible in transcription tasks [4]. For this reason, it has been argued that the difficulty in processing dysarthric speech extends beyond a listener’s ability to correctly identify words – and it is likely that listeners must also expend greater effort and cognitive resources when listening to a person with dysarthria [6]. Unfortunately, measurements of listener effort and difficulty are rarely examined outside of subjective rating scales [e.g., 7].

An increase in the mental demands required to process speech is likely to affect participation in social interactions. For example, Brady et al. [8] suggested that communication partners were more likely to limit their conversations to “easier” topics, avoiding discussions of more complex ideas, like politics and religion, when speaking to a person with dysarthria. To understand the cause of these participation restrictions, it is important to fully characterize the challenges in processing dysarthric speech. To explore these issues, we will first discuss the limitations of orthographic transcription and rating scale assessments in assessing listeners’ attentional demands. We will then consider the role of RT and comprehensibility measurements as an adjunct to these evaluations.

Measurements from Listeners’ Orthographic Transcription

Assessments of intelligibility using orthographic transcription involve calculating the percentage of words a listener correctly identifies or the percentage of phonemes correctly identified in a word. A higher percentage of errors are considered to reflect less intelligible speech. In their seminal paper, Tikofsky and Tikofsky [9] used these procedures to demonstrate a significant difference in the intelligibility of speakers with dysarthria compared to those without. Differences were also found between speakers with dysarthria, which led to further work examining orthographic transcription as a measure of classifying dysarthria severity [10, 11].

Despite their ubiquitous use in clinical practice, transcription-based measurements have limitations. For example, transcription accuracy is susceptible to ceiling effects. This occurs when a speaker has speech that is clearly affected by dysarthria (e.g., speech that is slurred, breathy, with an unnatural rate or rhythm) but still has 100% of their words correctly identified by listeners. Sussman and Tjaden [4] compared orthographic transcription and rating scale estimates in healthy control speakers, speakers with multiple sclerosis, and speakers with Parkinson’s disease. They reported that the rating scale estimates of speech severity were more sensitive to the presence of dysarthria than word or sentence intelligibility scores calculated from listener transcriptions. Transcription accuracy scores are also readily influenced by different speech stimuli and listener variables, which make their values sensitive to changes in the testing procedure [2, 12-14]. Thus, it is possible for the same speaker to generate quite different intelligibility scores based on the test procedure employed.

To produce consistent and generalizable results, orthographic transcription assessments typically include a range of words or sentences. This allows the assessor to average the number of words correctly identified across a number of short speech samples. For example, in the sentence intelligibility test (SIT), at least 2 listeners must hear and then transcribe 11 sentences to get an intelligibility score for each speaker [15]. The recording, transcribing, and scoring of multiple speech samples are a time-consuming procedure, which may reduce the regularity in which this type of assessment is used.

Listener Rating Scale Measurements

In clinical research, rating scales are often employed as a faster method of quantifying listeners’ perceptual impressions of speech, including its level of naturalness, intelligibility, or severity. For example, listeners are commonly asked to rate either the speakers’ intelligibility or “how easy” the speaker is to understand [16-19]. In many rating methods, listeners are provided with an ordinal scale to rate speech samples. Unfortunately, the magnitude intelligibility difference between numbers on these scales is often not equal, which can lead to issues when attempting to add, subtract, or average scores [20]. For example, the amount of change from a score of 4 to a score of 5 may be larger than the change from a score of 1–2. Ordinal scales also limit the degree to which listeners can rate fine-grained differences in speech samples (e.g., when documenting changes in the same person over time). To address this issue, researchers sometimes use visual analog scales or direct magnitude estimates to document treatment changes in dysarthria [e.g., 21]. However, when using a more continuous scale, it is not always clear whether small changes in ratings are clinically significant [22].

One of the main criticisms of rating scale estimates is their inherent subjectivity. In any type of listener rating procedure, subjectivity can produce meaningless variation in scores that are difficult to control and interpret. Thus, it is hard to meaningfully compare 2 rating scale scores that are provided by different listeners. In addition, it has been hypothesized that listeners might express some biases in their ratings of intelligibility, which are unrelated to their difficulty in processing dysarthric speech. For example, Street et al. [23] found that listeners exhibited a natural bias toward speakers with rates of speech similar or slightly faster than their own speech rate. It is possible that these biases have an undue influence on ratings of intelligibility, especially in speakers with mild dysarthria who can be transcribed with 100% accuracy. However, without more objective measurements of listener effort and speech processing, the nature of these influences cannot be determined. Together, these limitations highlight the need for further development of fast and objective assessment techniques.

Assessing RT to Auditory Stimuli

Listening and responding to dysarthric speech are likely to require increased processing time and attention. For example, studies find that listeners with larger working memories, vocabularies, and greater cognitive flexibility tend to be more successful in correctly identifying words spoken by people with dysarthria [24, 25]. This performance difference is hypothesized to occur because dysarthric speech requires greater cognitive resources to process. Indeed, increased processing demands associated with dysarthria are likely to occur very early upon hearing a word or phrase. In an electroencephalography study, Theys and McAuliffe [26] demonstrated that the N100 (a measure of earlier sensory processing) showed increased amplitude and decreased latency for dysarthric speech, as compared to healthy speech samples. Theys and McAuliffe [26] postulated that the quality of dysarthric speech may affect listeners’ early sensory processing, as they immediately allocate greater cognitive resources to the acoustic signal.

RT measurements are often used to examine subtle differences in cognitive processing [27]. RT is thought to reflect mental demand, with a longer RT indicating increased cognitive workload [28]. Prolonged RT has been used numerous times as evidence of increased cognitive difficulty and processing time [29-32]. In studies of auditory perception, RT has been used to understand differences when listening to foreign versus familiar accents [30, 33], synthetic speech versus live speech [34, 35], and speakers with and without dysphonia [36]. RT is also thought to have a certain degree of listener objectivity as the measurement is designed to illicit an immediate, automatic response [37].

Across studies, listeners’ RTs have consistently been higher when exposed to more challenging listening conditions. For example, Munro and Derwing [30] compared RTs for listeners hearing native English speakers and Mandarin-accented English speakers. They found that despite both groups of speakers often being fully intelligible to listeners, listeners rated the accented speakers as more difficult to understand and took longer to respond to their speech. Orthographic transcription did not reflect rating scale scores as well as the RT measure, due to their lack of sensitivity to mild differences between the speakers.

Unfortunately, RTs to dysarthric speech have been rarely examined in speech pathology literature. Most notably, Cote-Reschny and Hodge [38] explored transcription response time in a study of 4 young children with spastic dysarthria whose speech was presented in noise. In this study, response time was calculated as the duration between word onset and the time taken to enter the first letter of that word in a transcription task. Similar to the findings of Munro and Derwing [30], the results demonstrated that response times were more closely associated with ratings of perceived effort than listener accuracy in identifying the words spoken. Hence, although the study was not designed to compare differences in RT between speakers, it provided clear evidence of the utility of these measurements in assessing aspects of communication impairment not fully explained by intelligibility scores.

Veracity Statements

A veracity statement is a true or false statement that is straightforward and can be easily determined using everyday knowledge. These statements have a history of use in RT experiments, where participants are required to listen to a sentence and then respond (select true or false) based on the information they heard. RT to veracity statements is hypothesized to be fairly sensitive to changes in speech quality. For example, Manous et al. [39] used veracity statements to investigate RT to natural voices and synthetic voices. They found significant effects of voice type on RT. More recently, both Munro and Derwing [30] and Wilson and Spaulding [33] used veracity statements to assess differences between response time to native and accented speakers. Both studies reported significant differences between the groups.

There are several reasons veracity statements might be appropriate for an RT measurement of dysarthria severity. First, veracity statements allow us to examine how listeners identify and process words in combination. Second, responding to a veracity statement is a more functional, everyday task than digit recall or word identification. Listening to a sentence and responding based on the information heard is, in essence, the very nature of communication. Hence, examining the time taken to process these statements provides insight into the real-world cognitive demands faced when listening to dysarthric speech.

Veracity Statements and Measurements of Listener Comprehension

In addition to providing RT data, veracity statements allow researchers to index whether the listener understands a speaker’s statements based on the accuracy in which they respond to the forced-choice paradigm. This idea of comprehending a message, as opposed to simply identifying the words within it, has been discussed in previous dysarthria literature [40-42]. It has been suggested that speakers who have the same intelligibility scores might differ in their ability to be understood. This is because accurate speech comprehension requires more complex semantic processing skills and integration of real-world knowledge, skills that are not assessed in orthographic transcription tasks [40]. In previous studies of dysarthric speech comprehension, longer narratives have been used, where the listener would be expected to actively draw upon their world knowledge and selectively remember key ideas [40-42]. However, many of these skills, such as selectively remembering and summarizing information, are not required in order to respond correctly to veracity statements.

The accuracy of listener responses to veracity statements relies primarily on word recognition and semantic processing. As veracity statements in the current study consist of only 3 words, the accuracy of a listener’s response generally requires correct recognition of all 3 words. In this context, listener comprehension is expected to be closely tied to the intelligibility of words in each statement. However, unlike orthographic intelligibility measurements, listeners do not receive partial marks for the correct perception of only one key word. This, alongside the extra attention required to semantically process the sentence, may highlight some of the additional challenges faced when conversing with a person with dysarthria.

In summary, RT to veracity statements may offer a useful method of indexing the increased processing effort required to understand dysarthric speech while offering insight into listeners’ understanding. RT measures are objective and functionally interpretable. These features are ideal when trying to track and monitor a speech disorder over time. In addition, these measurements may act as an important adjunct in the assessment of motor speech disorders, offering insight into subtle differences in the effort and processing demands experienced by communication partners.

Study Rationale

The current study has several aims. The primary purpose is to determine whether RT measurements are sensitive to dysarthria severity and whether this measure is comparable to established forms of dysarthria assessment. To address these aims, specific research questions have been created below:

1 How reliable and consistent are listeners’ RTs to dysarthric speech?

2 Are results from an RT paradigm correlated with those obtained from common measures of speech impairment (orthographic transcription and rating scales)?

3 How accurate are responses to veracity statements at different levels of speech severity?

Speakers

Approval to undertake this study was provided by the University of Canterbury Human Ethics Committee. All speakers and listeners provided informed, written consent prior to their participation. Male and female speakers of New Zealand English with a diagnosis of dysarthria, between the ages 43 and 80 years, provided speech samples that were analyzed in this study. Digital audio recordings were made via an Audix HT2 headset condenser microphone positioned approximately 5 cm from the mouth with a sampling rate of 48 kHz and 16 bits of quantization. There were initially 12 speakers recorded, with a subset of 8 speakers chosen to represent a wide spectrum of mild through to severe dysarthria. In selecting for participants, individual speech features (e.g., strained voicing or equal-and-excess stress) were not specifically controlled for and the participants presented with a range of dysarthria etiologies. This meant that the sample represented a variety of dysarthria presentations, as might be seen in a typical clinic setting. Eight healthy speakers were also recruited to provide control data for the RT experiment. These speakers approximately matched the speakers with dysarthria for both age and gender. Demographics and etiological information can be found in -Table 1.

Table 1.

Speaker demographic and etiological information

Speaker demographic and etiological information
Speaker demographic and etiological information

Stimuli

Orthographic Transcription Stimuli

Each speaker was recorded saying 11 sentences ranging from 5 to 15 words in length. These sentences were sourced from the SIT Manual [15] and were unique to each speaker. During administration of the test, if there were any reading errors or disruptions (e.g., omissions or substituted words), the participants were immediately prompted to repeat the sentence. The error-free sentence was then selected for use in the transcription task.

Rating Scale Stimuli

Each speaker was recorded reading the “The Grandfather Passage” [43]. A single sentence was then extracted from each recording to be used in the rating scale assessment. This sentence was: “He slowly takes a short walk in the open air each day.” The sentence was chosen because it was free of reading errors and disruptions across all speakers. The use of matching stimuli for listener rating procedures is consistent with previous literature [16, 17].

Veracity Statements

True or false statements were taken from a previous study of speech perception [44]. Each speaker was recorded reading aloud 20 statements, selected at random, half of which were true and the other half false. After recording, these sentences were reviewed to check for disruptions and reading errors (i.e., if participants laughed or coughed mid-recording). The sentences were then also checked for ambiguity by 5 independent raters. The raters were asked to select whether a written version of the statement was true or false and to state how confident they were in their answer on a scale of 1–5 (one meaning not at all confident and 5 meaning very confident). All correctly identified sentences with a confidence score of 4 or 5 were considered for inclusion. Ultimately, of the original 20 statements recorded, 15 statements that met the inclusion criteria were randomly selected for each speaker. All statements included only 3 words and were between 3 and 6 syllables in length (e.g., “fire is hot,” “rainbows are colourful,” “zebras have spots,” “sandpaper is smooth”).

Listeners

Sixteen participants were recruited to complete the listening experiments. All listeners were between 18 and 30 years old and had hearing thresholds within normal limits (defined as hearing thresholds at or above 20 dB across 4 frequencies, 0.5, 1, 2, and 4 kHz). The participants had limited to no previous experience in listening to people with dysarthria. All listeners were fluent speakers of New Zealand English, either being born and raised in New Zealand or having lived in New Zealand since childhood.

Procedure

The listening experiments were all conducted using custom MatLab scripts, and participants listened to the auditory stimuli using Sennheiser HD 201 headphones. Participants were seated in front of a desktop computer screen with a keyboard placed on the desk in front of them.

Practice Trial

A practice trial was played, where participants could listen to speech samples from speakers with dysarthria. This trial included 5 sentences that were true or false, and participants responded by pushing a button to indicate their answer (N for true and M for false). They could use any system that was comfortable to push the buttons, often this being either the index finger of each hand on a response button or the dominant index finger used to push both buttons. After hearing these sentences, participants had the option of increasing or decreasing the volume for the remainder of the tasks. All listeners were happy to have the volume at the preset level.

This practice trial served as a primer for listeners to hear what a speaker with dysarthria sounds like and to practice the task required for the RT experiment. Speakers in the practice trial were not used in any other task to prevent familiarity with speakers and sentences.

Experiment Configuration

The 3 listening experiments discussed in the following sections were counterbalanced between participants to avoid any effects of task order. In addition, speech stimuli within each experiment were presented in a random order, so that every listener was exposed to a unique sequence of trials. After these experiments were completed, there was one final task in which listeners heard 3 pure tones of different frequencies and were asked to push the spacebar as soon as the tone was detected. This final task was used to derive a baseline measure of RT to auditory stimuli, as discussed in the RT data treatment section.

Orthographic Transcription: Sentence Intelligibility Test

This assessment consisted of listeners hearing sentences from the SIT, read by 4 of the 8 speakers with dysarthria, with 44 sentences played in total. Listeners transcribed each sentence individually by typing the words they could hear into a textbox as per the SIT procedures. They were instructed to write whatever they heard, regardless of whether it made sense. They could begin to type as soon as the speaker began talking, to offset issues with memory for the longer sentences. Speakers were split into 2 groups for this experiment, with half of the listeners hearing one group of 4 speakers (P1, P4, P7, P8) and the other half of listeners hearing the remaining 4 speakers (P2, P3, P5, P6). This was done to ensure the experiment was of a reasonable time length. The 2 speaker groups were selected at random, and listeners were randomly assigned to transcribe speech from one of the 2 groups.

Orthographic Transcription: Data Treatment

For each sentence, the proportion of words correct was calculated based on the number of words correct divided by the total number of target words in each sentence. As per SIT protocols [15], any additions were ignored, and spelling mistakes and homonyms (words that sound the same but are spelt differently to the target word) were included as correct words. This resulted in a percent correct score for each sentence. Since 11 sentences were used per speaker, but only 8 listeners rated each speaker, this gave a total of 88 scores for each speaker.

Rating Scale

This assessment consisted of listeners hearing the same sentence from the grandfather passage (“He slowly takes a short walk in the open air each day”) [43] from each of the 8 speakers. Randomization in the order of stimuli presentation ensured that effects of task order (e.g., increasing stimuli familiarity) did not disproportionally influence any 1 speaker. Listeners were asked to rate each speaker using a visual analogue scale as per Fletcher, McAuliffe [45], with a range from “difficult” at one end to “easy” at the other. Listeners were asked to rate “how difficult is it to understand this speech?”

Rating Scale: Data Treatment

Scores extracted from the rating scale were on a continuous scale ranging from 0 (representing speech that was the most difficult to understand) to 10 (easiest to understand). This resulted in 16 listener ratings for each speaker, one from every listener.

RT Experiment

Two different sets of stimuli were used in the RT experiment, each set consisting of 4 speakers with dysarthria and 4 healthy control speakers, all reading veracity statements. The control speakers in one stimuli set read identical statements to the speakers with dysarthria in the opposite stimuli set, so each listener heard every veracity statement from a speaker either with or without dysarthria. Listeners were instructed to respond the same way they did in the initial practice trials, by pushing a button (n) if the statement was true and (m) for false. These buttons were assigned for their location on the keyboard, as they were fairly central for either hand. If the listener was unsure of the correct response, they were instructed to guess. Listeners who heard P1, P4, P7, and P8 in the orthographic transcription task heard P2, P3, P5, and P6 in this experiment. This was done to reduce perceptual learning effects (i.e., becoming more accurate or faster in processing a person’s speech due to experience hearing the speaker in previous experiments).

RT: Data Treatment

To calculate exact RT for each sentence, some additional measurements were required. The overall RT was recorded from the time the sentence began until the time the listener pushed a button to select their answer (true or false). However, each sentence was of a different recorded length, so their raw RTs could not be directly compared. To resolve this issue, for each sentence used, the point where a decision could be made to determine if it were true or false was identified and marked. For most sentences, this point did not occur until the final word, as the pairing of the first and second word alone was ambiguous (e.g., Cars have …, water is …), but for a few sentences the middle word (a verb) could be used to determine if the sentence was false (e.g., ovens sew …, clothes heat …). Because spoken words undergo immediate processing by listeners – and can often be correctly identified prior to hearing the complete signal – the onset of the identified word was marked rather than the offset. Therefore, the time of response determination was identified at the onset of this word in each sentence, which was manually marked by visual inspection of the waveform in PRAAT [46]. This allowed us to subtract the time between the start of the recording and the point of response determination from our measurement of RT, ensuring that reactions were only recorded from the point at which a decision could be accurately made by the listener.

As discussed in the experiment configuration section, listeners also completed a task in which they responded to pure tones by pressing a button when a tone was played. This measurement was used to account for how long it takes a listener to motorically respond to auditory stimuli. To do this, an average of each listener’s auditory RTs was subtracted from their initial RT to each sentence. It was assumed that any reaction to speech stimuli should be longer than this time, as during the RT experiment the listener would also have to hear the final word of the of the sentence, process its meaning, and decide if the sentence was true or false, prior to a motoric response. Thus, the final, normalized measurement of sentence RT was defined as the time following the onset of the sentence’s key “decisive” word, with each listener’s average response times to pure tones removed. This measurement was theorized to best reflect the RT that could be attributed to speech processing.

RT: Excluded Data

Any RT that was 2 SDs away from that listeners’ average RT was excluded, as it was assumed that these items reflected fatigue or distraction during the task, as per Munro and Derwing [30]. Any response time that was negative (e.g., quicker than the average motoric response to a pure tone) was also removed, as this indicated that the listener had responded before hearing the sentence’s key word and was thus not attempting to process the meaning of the sentence. This was between 0 and 14 sentences per listener. Thus, there were between 106 and 120 RT scores per speaker.

It should be noted that inaccurate responses were not removed from the overall analysis of RTs. This was due to a high level of inaccurate responses recorded for some speakers with dysarthria. For example, in the case of 1 speaker, removal of incorrect responses would have required the deletion of over 40% of their RT data. Hence, the removal of incorrect responses would significantly reduce the data available for certain participants, while producing an unbalanced data set. In cases of severe dysarthria, it was also observed that listeners tended to produce longer response times to sentences that were ultimately perceived incorrectly. Thus, the additional time taken to produce these incorrect responses was considered another indicator of the increased cognitive processing required when listening to an ambiguous speech signal. For this reason, the inclusion of listeners’ reactions to wholly unintelligible speech was considered necessary in order to demonstrate the full effects of dysarthria severity on RT.

A Comparison of Measurement Reliability

To explore the interrater reliability of RT assessments, we examined the relationship between listeners’ responses on a trial-by-trial basis. The comparison of listener responses was completed prior to scores being averaged across trials of the same speaker. This was done to provide a fair comparison between the 3 assessments of dysarthria (which contained different numbers of trials per speaker). For example, the rating scale was a quick assessment, where each listener heard only one sentence per speaker, whereas the average orthographic intelligibility and RT scores were derived from 11 to 15 statements per speaker.

Intraclass correlation coefficients (ICCs) were calculated for each of the 3 assessments. The ICC can be used to assess the reliability of scores that are averaged across multiple judges, as well as the reliability of scores taken from a single rater. In our analysis, we present both results. Single-rater measurements provide a calculation of reliability in the context where scores from a single listener will be used to determine the final rating (see Koo and Li [47] for an explanation of the formula). Listeners’ subjective severity ratings achieved high interrater reliability across trials, with a single-rater ICC (2, 1) of 0.75. The reliability of the average score across judges was higher still, with an ICC (2, 16) of 0.98. As discussed in the methods, listeners were split into 2 groups in the transcription and RT task, with each group hearing a different set of speakers. In the orthographic intelligibility task, the group of listeners who transcribed P1, P4, P7, and P8 achieved a single-rater ICC (2, 1) of 0.78 and an average-measures ICC (2, 8) of 0.97. The second group achieved a single-rater ICC (2, 1) of 0.66 and an average-measures ICC (2, 8) of 0.94.

The first group of listeners was also more reliable in the RT task, despite hearing a different set of speakers. They achieved a single-rater ICC (2, 1) of 0.45 and average-measures ICC (2, 8) of 0.87 for their RT responses. The second group had a single-rater ICC (2, 1) of only 0.29, but a considerably higher average-measures ICC (2, 8) of 0.76.

Comparison of RT Measurement between Healthy Speakers and People with Dysarthria

Before comparing the RT measurements to other dysarthria assessments, we compared the median RTs for speakers with dysarthria against the median RTs of healthy speakers producing the same matched sentences. These data are presented in Table 2. Control speakers had a relatively small range of median RTs (from 444 to 621 ms). All speakers with dysarthria had higher median reaction score times than their matched control speaker (who read the same stimuli). However, in the case of P1 and P2, the overall distribution of responses across stimuli was not significantly different from their matched counterpart. The remaining 6 speakers with dysarthria demonstrated clear differences in listener RTs. A Mann-Whitney-Wilcoxon test indicated that RT was significantly greater for speakers with dysarthria (Mdn = 0.95) than for the matched control speakers (Mdn =0.53) as U = 236,790, p < 0.001, 95% CI (–0.42 to –0.32).

Table 2.

A comparison of RTs between speakers with dysarthria and matched healthy participants

A comparison of RTs between speakers with dysarthria and matched healthy participants
A comparison of RTs between speakers with dysarthria and matched healthy participants

Distribution of Listener Ratings, RTs, and Transcription Accuracy of Speakers with Dysarthria

Figure 1 outlines the range of the RT data exhibited by listeners when responding to speakers with dysarthria. As can be seen in Figure 1, the overall data are positively skewed, demonstrating that while listeners are often able to respond quickly to the dysarthric stimuli, there are many instances where over 2 s were taken to listen to the final word and judge the veracity of the statement.

Fig. 1.

Bar chart frequency distribution of RTs to all stimuli from speakers with dysarthria.

Fig. 1.

Bar chart frequency distribution of RTs to all stimuli from speakers with dysarthria.

Close modal

Figure 2 demonstrates the frequency distribution for the 44 transcription scores provided by each listener. As with the RT data, there is a significant skew in the distribution of these data. Listeners frequently (n = 225 times) achieved 100% accuracy in their transcriptions. However, the second most frequent score was 0%, which occurred in 47 trials. In addition, because the length of sentences was capped at 15 words, it was impossible for an individual sentence to have between 1 and 5 or 94 and 99% of its words correct.

Fig. 2.

Bar chart frequency distribution of orthographic transcription scores to all stimuli from speakers with dysarthria.

Fig. 2.

Bar chart frequency distribution of orthographic transcription scores to all stimuli from speakers with dysarthria.

Close modal

Figure 3 reflects the distribution of rating score data from listeners. This figure illustrates considerable variability in the rating scale data, with listeners using the entire length of the scale to describe their perceptions of dysarthria severity.

Fig. 3.

Bar chart frequency distribution of ratings given to stimuli from all speakers with dysarthria.

Fig. 3.

Bar chart frequency distribution of ratings given to stimuli from all speakers with dysarthria.

Close modal

Comparison of RT Measurements to Standard Assessments of Dysarthria

One of the primary aims of this study was to determine to what degree RT scores were correlated with other forms of perceptual speech assessment to provide insight into the processing demands placed on listeners at different levels of dysarthria severity.

Correlation between RT and Orthographic Transcription

To calculate an orthographic transcription score, mean percent intelligibility for each speaker was determined by averaging the intelligibility of the 11 sentences provided by each speaker. Since each sentence was transcribed by 8 listeners, the overall score for each speaker was derived from 88 transcriptions. These values were compared with the average RT to each speaker, which was calculated following the outlier removal procedures discussed in the methods section.

There was a significant negative correlation between RT scores and transcription-based scores for each speaker, with r(6) = –0.94, p < 0.001. This demonstrates that as more words were accurately transcribed in the SIT, there was a shorter RT to veracity statements.

Correlation between RT and Rating Score

Mean rating scores for each speaker were calculated by taking the average score provided by the 16 listeners. Thus, only 16 trials per speaker were used to determine a rating score. After averaging across listeners, 2 distinct speaker groupings appeared, separating speakers into a less affected (P1, P2, P3) and more affected cluster (P4, P5, P6, P7, P8). There was a significant negative correlation between RT and rating scale measures, with r(6) = –0.89, p = 0.003.

Correlation between Orthographic Transcription and Rating Score

A significant positive correlation was found between average rating scores and percent of words correctly transcribed for each speaker, with r(6) = 0.88, p = 0.004. Higher rating scores corresponded to a higher percentage of correctly transcribed words.

Comparison of Average Veracity Statement Accuracy to Listener RT, Transcription Accuracy, and Rating Scale Scores

The use of veracity statements to measure RT allowed us to extract the number of times a speaker’s statements were responded to correctly. We converted this to an average accuracy percentage for each speaker, which is presented in Table 3. Table 3 compares the average accuracy of listener responses against the speaker’s average score in the other 3 speech assessments. The accuracy scores demonstrated the clearest relationship to RT scores with r(6) = –0.92, p = 0.001, in addition to strong linear relationships with transcription-based measurements r(6) = 0.86, p = 0.006 and rating scale scores r(6) = 0.76, p = 0.027.

Table 3.

A comparison of mean dysarthria assessment scores across speakers with SD in parenthesis

A comparison of mean dysarthria assessment scores across speakers with SD in parenthesis
A comparison of mean dysarthria assessment scores across speakers with SD in parenthesis

The purpose of this research study was to explore the value of RT measurements as an adjunct assessment of dysarthria. Listeners heard veracity statements from people with a range of dysarthria severities and responded by pushing a button to indicate their answer, which was timed. It was hypothesized that there would be significant differences in their RTs that would be related to the severity of their dysarthria, indicating a relationship between increased listener processing demands and dysarthria severity.

Measurement Reliability

Results from the reliability assessments demonstrated that the absolute agreement in listeners’ RTs and transcription accuracy was variable when listening to single sentences. Typically, ratings of above 0.75 are considered to indicate good reliability, while ratings of above 0.9 indicate excellent reliability [47]. The strongest reliability came from rating scale measurements. These rating scale scores demonstrated good single-rater reliability, indicating that a single listener’s scores were likely to be very consistent with the mean group ratings. In a clinical context, rating scale measurements are often completed by a single listener after hearing only one speech excerpt, so high reliability from a single trial is a positive finding. Orthographic intelligibility scores are typically averaged across multiple trials (and listeners) to reduce the variation inherent in the measurement. The reliability scores from this study indicate that this is appropriate, as the ICC was excellent in both listener groups when considering the average-rater score from across 8 listeners. A similar finding was apparent for the RT measures, which demonstrated poor single-rater reliability but good average-rater ICC values when averages from a group of 8 listeners were considered. It is clear that in order to get a meaningful RT score, average values from multiple listeners and stimuli are necessary when assessing dysarthria. A review of data from each speaker (Table 3) indicates that orthographic intelligibility and RT scores have greater variability across listeners/stimuli in cases of more severe dysarthria.

Correlation between RTs and Other Assessments of Dysarthria

The results of this study demonstrate that RTs become greater as speakers are more affected by dysarthria. RT had a strong correlation to orthographic transcription accuracy (r = 0.94). RTs were also closely associated with listener rating scores (r = 0.89). The fact that all 3 measures were closely associated provides evidence that demands on listener processing are closely related to dysarthria severity. Previous studies demonstrate that listeners’ RTs are higher when exposed to more challenging listening conditions, and prolonged  RT has been used numerous times as evidence of increased cognitive difficulty and processing time [29-33]. Higher RTs in speakers with severe dysarthria are likely to reflect the added cognitive resources needed to attend to imprecise stimuli [48], including the increased demand on working memory to store and review the speech signal, while top-down processing of speech occurs [49].

Listener Understanding of Veracity Statements and Orthographic Intelligibility

It is interesting to note some discrepancies in speakers’ orthographic intelligibility scores and listener accuracy in the veracity statement task. When considering speakers’ overall accuracy in a forced-choice true/false paradigm, it is useful to double the number of errors reported, as a listener who cannot understand a statement has a 50% chance of responding accurately. However, even when the number of inaccurate responses is doubled, speakers P3, P4, and P5 had higher accuracy responses than might be expected based on their intelligibility. This may be because the veracity statements used in this study were substantially shorter than the SIT sentences and only required the listener to correctly identify a couple of key content words in order to interpret their meaning. We hypothesize that some speakers with a moderate intelligibility impairment are able to produce shorter sentences with more deliberate, clear speech patterns than the longer SIT stimuli. Furthermore, listeners may experience less demands on their working memory when processing these shorter sentences. In contrast, speakers P6, P7, and P8 had lower accuracy scores than might be predicted based on their intelligibility. This may be because their intelligibility has reduced to a point where phonologically similar content words become almost impossible to discriminate, and clearer speech patterns cannot be utilized for short sentences. If a listener misses a key content word in any veracity statement, they will not have enough information to interpret its meaning. Thus, despite correctly perceiving the occasional word, the listener will be forced to guess whether the statement is correct when speakers present with severe dysarthria. This could account for the differences in SIT scores (where it is possible to gain “partial credit” for understanding a few function words) and listener accuracy in responding to veracity statements (which require accurate perception of multiple content words).

It is also worth noting that there were inconsistencies in speaker P6’s results relative to other speakers (i.e., this speaker had the 4th highest SIT score, but only the 6th highest veracity statement accuracy). This could indicate some additional differences in the speaker’s production of the 2 stimuli, due to their tendency to place careful emphasis on every syllable in the longer SIT sentences, combined with a slight decline in articulatory precision in their production of the veracity statements (perhaps due to fatigue during the recording session).

Limitations and Future Directions

Further testing and development of RT stimuli need to be done to ensure that variations in listener knowledge and speech stimuli can be accounted for in the measurement of RT. Future work should consider the phonetic composition, word frequency, and probability of veracity statements, with the aim of more carefully balancing stimuli across speakers. For example, a sentence that is more complex in vocabulary is likely to require a longer amount of time to process, affecting the results of RT tests. Future studies may benefit from having groups of speakers repeating the same statements, in order to develop mixed-effects models that can partition variance across speakers, listeners, and stimuli (while also factoring in the order of presentation). This could be used to produce a set of normative RT data for healthy speakers and people with dysarthria.

RT measurements offer a valuable adjunct in the assessment of dysarthria. The measurements are objective, interpretable, and believed to be closely related to the cognitive demands required to process dysarthric speech. In the current study, RT measurements demonstrated strong correlations with orthographic intelligibility measurements and scores derived from listener rating scales. However, the measurements were not particularly sensitive to mild dysarthria. It was evident that RTs can vary considerably across speech stimuli and listeners. Like orthographic intelligibility measurements, average listener scores are required to achieve consistency in assessment results. Further standardization of stimuli is needed in order to develop normative data for this population. Standardization of stimuli may also increase the RT assessment’s sensitivity to mild dysarthria. Ultimately, there is compelling evidence to support RT assessments as a valid method of indexing communication impairment. With additional research to standardize stimuli and procedures, RT measures could serve as a useful adjunct in the assessment of dysarthria.

We would like to thank the Oticon Foundation in New Zealand for helping to fund this line of research.

The authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or nonfinancial interest in the subject matter or materials discussed in this manuscript.

1.
Duffy
JR
. Motor Speech Disorders-E-Book: Substrates, Differential Diagnosis, and Management: Elsevier Health Sciences;
2013
.
2.
Hustad
KC
.
Estimating the intelligibility of speakers with dysarthria
.
Folia Phoniatr Logop
.
2006
;
58
(
3
):
217
28
.
[PubMed]
1021-7762
3.
Weismer
G
,
Laures
JS
.
Direct magnitude estimates of speech intelligibility in dysarthria: effects of a chosen standard
.
J Speech Lang Hear Res
.
2002
Jun
;
45
(
3
):
421
33
.
[PubMed]
1092-4388
4.
Sussman
JE
,
Tjaden
K
.
Perceptual measures of speech from individuals with Parkinson’s disease and multiple sclerosis: intelligibility and beyond
.
J Speech Lang Hear Res
.
2012
Aug
;
55
(
4
):
1208
19
.
[PubMed]
1092-4388
5.
Bunton
K
,
Kent
RD
,
Duffy
JR
,
Rosenbek
JC
,
Kent
JF
.
Listener agreement for auditory-perceptual ratings of dysarthria
.
J Speech Lang Hear Res
.
2007
Dec
;
50
(
6
):
1481
95
.
[PubMed]
1092-4388
6.
Beukelman
DR
,
Childes
J
,
Carrell
T
,
Funk
T
,
Ball
LJ
,
Pattee
GL
.
Perceived attention allocation of listeners who transcribe the speech of speakers with amyotrophic lateral sclerosis
.
Speech Commun
.
2011
;
53
(
6
):
801
6
. 0167-6393
7.
Whitehill
TL
,
Wong
CC
.
Contributing factors to listener effort for dysarthric speech
.
J Med Speech-Lang Pathol
.
2006
;
14
(
4
):
335
42
.1065-1438
8.
Brady
MC
,
Clark
AM
,
Dickson
S
,
Paton
G
,
Barbour
RS
.
The impact of stroke-related dysarthria on social participation and implications for rehabilitation
.
Disabil Rehabil
.
2011
;
33
(
3
):
178
86
.
[PubMed]
0963-8288
9.
Tikofsky
RS
,
Tikofsky
RP
.
Intelligibility measures of dysarthric speech
.
J Speech Hear Res
.
1964
Dec
;
7
(
4
):
325
33
.
[PubMed]
0022-4685
10.
Tikofsky
RS
.
A revised list for the estimation of dysarthric single word intelligibility
.
J Speech Hear Res
.
1970
Mar
;
13
(
1
):
59
64
.
[PubMed]
0022-4685
11.
Yorkston
KM
,
Beukelman
DR
.
A clinician-judged technique for quantifying dysarthric speech based on single-word intelligibility
.
J Commun Disord
.
1980
Jan
;
13
(
1
):
15
31
.
[PubMed]
0021-9924
12.
Beukelman
DR
,
Yorkston
KM
.
Influence of passage familiarity on intelligibility estimates of dysarthric speech
.
J Commun Disord
.
1980
Jan
;
13
(
1
):
33
41
.
[PubMed]
0021-9924
13.
McAuliffe
MJ
,
Fletcher
AR
,
Kerr
SE
,
O’Beirne
GA
,
Anderson
T
.
Effect of dysarthria type, speaking condition, and listener age on speech intelligibility
.
Am J Speech Lang Pathol
.
2017
Feb
;
26
(
1
):
113
23
.
[PubMed]
1058-0360
14.
Kalikow
DN
,
Stevens
KN
,
Elliott
LL
.
Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability
.
J Acoust Soc Am
.
1977
May
;
61
(
5
):
1337
51
.
[PubMed]
0001-4966
15.
Yorkston
K
,
Beukelman
D
,
Tice
R
.
Sentence intelligibility test
.
Lincoln (NE)
:
Tice Technologies
;
1996
.
16.
Kim
Y
,
Kent
RD
,
Weismer
G
.
An acoustic study of the relationships among neurologic disease, dysarthria type, and severity of dysarthria
.
J Speech Lang Hear Res
.
2011
Apr
;
54
(
2
):
417
29
.
[PubMed]
1092-4388
17.
Tjaden
K
,
Wilding
GE
.
Rate and loudness manipulations in dysarthria: acoustic and perceptual findings
.
J Speech Lang Hear Res
.
2004
Aug
;
47
(
4
):
766
83
.
[PubMed]
1092-4388
18.
Turner
GS
,
Tjaden
K
,
Weismer
G
.
The influence of speaking rate on vowel space and speech intelligibility for individuals with amyotrophic lateral sclerosis
.
J Speech Hear Res
.
1995
Oct
;
38
(
5
):
1001
13
.
[PubMed]
0022-4685
19.
Weismer
G
,
Jeng
JY
,
Laures
JS
,
Kent
RD
,
Kent
JF
.
Acoustic and intelligibility characteristics of sentence production in neurogenic speech disorders
.
Folia Phoniatr Logop
.
2001
Jan-Feb
;
53
(
1
):
1
18
.
[PubMed]
1021-7762
20.
Schiavetti
N
,
Metz
DE
,
Sitler
RW
.
Construct validity of direct magnitude estimation and interval scaling of speech intelligibility: evidence from a study of the hearing impaired
.
J Speech Hear Res
.
1981
Sep
;
24
(
3
):
441
5
.
[PubMed]
0022-4685
21.
El Sharkawi
A
,
Ramig
L
,
Logemann
JA
,
Pauloski
BR
,
Rademaker
AW
,
Smith
CH
, et al
Swallowing and voice effects of Lee Silverman Voice Treatment (LSVT): a pilot study
.
J Neurol Neurosurg Psychiatry
.
2002
Jan
;
72
(
1
):
31
6
.
[PubMed]
0022-3050
22.
Todd
KH
,
Funk
JP
.
The minimum clinically important difference in physician-assigned visual analog pain scores
.
Acad Emerg Med
.
1996
Feb
;
3
(
2
):
142
6
.
[PubMed]
1069-6563
23.
Street
RL
 Jr
,
Brady
RM
,
Putman
WB
.
The influence of speech rate stereotypes and rate similarity or listeners’ evaluations of speakers
.
J Lang Soc Psychol
.
1983
;
2
(
1
):
37
56
. 0261-927X
24.
McAuliffe
MJ
,
Gibson
EM
,
Kerr
SE
,
Anderson
T
,
LaShell
PJ
.
Vocabulary influences older and younger listeners’ processing of dysarthric speech
.
J Acoust Soc Am
.
2013
Aug
;
134
(
2
):
1358
68
.
[PubMed]
0001-4966
25.
Ingvalson
EM
,
Lansford
KL
,
Fedorova
V
,
Fernandez
G
.
Receptive vocabulary, cognitive flexibility, and inhibitory control differentially predict older and younger adults’ success perceiving speech by talkers with dysarthria
.
J Speech Lang Hear Res
.
2017
Dec
;
60
(
12
):
3632
41
.
[PubMed]
1092-4388
26.
Theys
C
,
McAuliffe
M
.
Listening to disordered speech results in early modulations of auditory event-related potentials.
2017
.
27.
Sternberg
S
.
Memory-scanning: mental processes revealed by reaction-time experiments
.
Am Sci
.
1969
;
57
(
4
):
421
57
.
[PubMed]
0003-0996
28.
Barrouillet
P
,
Bernardin
S
,
Portrat
S
,
Vergauwe
E
,
Camos
V
.
Time and cognitive load in working memory
.
J Exp Psychol Learn Mem Cogn
.
2007
May
;
33
(
3
):
570
85
.
[PubMed]
0278-7393
29.
Gough
PB
.
Grammatical transformations and speed of understanding
.
J Mem Lang
.
1965
;
4
(
2
):
107
.0749-596X
30.
Munro
MJ
,
Derwing
TM
.
Processing time, accent, and comprehensibility in the perception of native and foreign-accented speech
.
Lang Speech
.
1995
Jul-Sep
;
38
(
Pt 3
):
289
306
.
[PubMed]
0023-8309
31.
Gatehouse
S
,
Gordon
J
.
Response times to speech stimuli as measures of benefit from amplification
.
Br J Audiol
.
1990
Feb
;
24
(
1
):
63
8
.
[PubMed]
0300-5364
32.
Houben
R
,
van Doorn-Bierman
M
,
Dreschler
WA
.
Using response time to speech as a measure for listening effort
.
Int J Audiol
.
2013
Nov
;
52
(
11
):
753
61
.
[PubMed]
1499-2027
33.
Wilson
EO
,
Spaulding
TJ
.
Effects of noise and speech intelligibility on listener comprehension and processing time of Korean-accented English
.
J Speech Lang Hear Res
.
2010
Dec
;
53
(
6
):
1543
54
.
[PubMed]
1092-4388
34.
Reynolds
ME
,
Fucci
D
.
Synthetic speech comprehension: a comparison of children with normal and impaired language skills
.
J Speech Lang Hear Res
.
1998
Apr
;
41
(
2
):
458
66
.
[PubMed]
1092-4388
35.
Evitts
PM
,
Searl
J
.
Reaction times of normal listeners to laryngeal, alaryngeal, and synthetic speech
.
J Speech Lang Hear Res
.
2006
Dec
;
49
(
6
):
1380
90
.
[PubMed]
1092-4388
36.
Evitts
PM
,
Starmer
H
,
Teets
K
,
Montgomery
C
,
Calhoun
L
,
Schulze
A
, et al
The impact of dysphonic voices on healthy listeners: listener reaction times, speech intelligibility, and listener comprehension
.
Am J Speech Lang Pathol
.
2016
Nov
;
25
(
4
):
561
75
.
[PubMed]
1058-0360
37.
Steffens
MC
.
Is the implicit association test immune to faking?
Exp Psychol
.
2004
;
51
(
3
):
165
79
.
[PubMed]
1618-3169
38.
Cote-Reschny
KJ
,
Hodge
MM
.
Listener effort and response time when transcribing words spoken by children with dysarthria
.
J Med Speech-Lang Pathol
.
2010
;
18
(
4
):
24
35
.1065-1438
39.
Manous
LM
,
Pisoni
DB
,
Dedina
MJ
,
Nusbaum
HC
.
Comprehension of natural and synthetic speech using a sentence verification task.
The Journal of the Acoustical Society of America.
1986
;79(S1):S25-S.
40.
Hustad
KC
.
The relationship between listener comprehension and intelligibility scores for speakers with dysarthria
.
J Speech Lang Hear Res
.
2008
Jun
;
51
(
3
):
562
73
.
[PubMed]
1092-4388
41.
Beukelman
DR
,
Yorkston
KM
.
The relationship between information transfer and speech intelligibility of dysarthric speakers
.
J Commun Disord
.
1979
May
;
12
(
3
):
189
96
.
[PubMed]
0021-9924
42.
Hustad
KC
,
Beukelman
DR
.
Listener comprehension of severely dysarthric speech: effects of linguistic cues and stimulus cohesion
.
J Speech Lang Hear Res
.
2002
Jun
;
45
(
3
):
545
58
.
[PubMed]
1092-4388
43.
Darley
FL
,
Aronson
AE
,
Brown
JR
.
Motor Speech Disorders
. 3rd ed.
Philadelphia (PA)
:
W.B. Saunders Company
;
1975
.
44.
Utianski
RL
,
Caviness
JN
,
Liss
JM
.
Cortical characterization of the perception of intelligible and unintelligible speech measured via high-density electroencephalography
.
Brain Lang
.
2015
Jan
;
140
:
49
54
.
[PubMed]
0093-934X
45.
Fletcher
AR
,
McAuliffe
MJ
,
Lansford
KL
,
Liss
JM
.
Assessing Vowel Centralization in Dysarthria: A Comparison of Methods
.
J Speech Lang Hear Res
.
2017
Feb
;
60
(
2
):
341
54
.
[PubMed]
1092-4388
46.
Boersma
P
,
Weenink
D
. Praat: doing phonetics by computer [Computer program]. 6.0.30 ed. Retrieved 11 February 2017 from http://www.praat.org/2017
47.
Koo
TK
,
Li
MY
.
A guideline of selecting and reporting intraclass correlation coefficients for reliability research
.
J Chiropr Med
.
2016
Jun
;
15
(
2
):
155
63
.
[PubMed]
1556-3707
48.
Sarampalis
A
,
Kalluri
S
,
Edwards
B
,
Hafter
E
.
Objective measures of listening effort: effects of background noise and noise reduction
.
J Speech Lang Hear Res
.
2009
Oct
;
52
(
5
):
1230
40
.
[PubMed]
1092-4388
49.
Wingfield
A
,
Alexander
AH
,
Cavigelli
S
.
Does memory constrain utilization of top-down information in spoken word recognition? Evidence from normal aging
.
Lang Speech
.
1994
Jul-Sep
;
37
(
Pt 3
):
221
35
.
[PubMed]
0023-8309
Copyright / Drug Dosage / Disclaimer
Copyright: All rights reserved. No part of this publication may be translated into other languages, reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, microcopying, or by any information storage and retrieval system, without permission in writing from the publisher.
Drug Dosage: The authors and the publisher have exerted every effort to ensure that drug selection and dosage set forth in this text are in accord with current recommendations and practice at the time of publication. However, in view of ongoing research, changes in government regulations, and the constant flow of information relating to drug therapy and drug reactions, the reader is urged to check the package insert for each drug for any changes in indications and dosage and for added warnings and precautions. This is particularly important when the recommended agent is a new and/or infrequently employed drug.
Disclaimer: The statements, opinions and data contained in this publication are solely those of the individual authors and contributors and not of the publishers and the editor(s). The appearance of advertisements or/and product references in the publication is not a warranty, endorsement, or approval of the products or services advertised or of their effectiveness, quality or safety. The publisher and the editor(s) disclaim responsibility for any injury to persons or property resulting from any ideas, methods, instructions or products referred to in the content or advertisements.