Abstract
Objective: While it is known that connected speech has different features to single-word speech, there are currently few recommendations regarding connected speech transcription. This research therefore aimed to develop a clinically feasible protocol for connected speech transcription. The protocol was then used to assist with description of the connected speech of children with childhood apraxia of speech (CAS), as little is known about their connected speech characteristics. Participants and Methods: Following a literature review, the Connected Speech Transcription Protocol (CoST-P) was iteratively developed and trialled. The CoST-P was then used to transcribe 50 connected utterances produced by 12 children (aged 6–13 years) with CAS. The characteristics of participants’ connected speech were analysed to capture independent and relational analyses. Results: The CoST-P was developed, trialled, and determined to have adequate reliability and fidelity. The frequency of inter-word segregation (mean = 29) was higher than intra-word segregation (mean = 4). Juncture accuracy was correlated with intelligibility metrics such as percentage of consonants correct. Conclusion: Connected speech transcription is challenging. The CoST-P may be a useful resource for speech-language pathologists and clinical researchers. Use of the CoST-P assisted in displaying CAS speech characteristics unique to connected speech (e.g., inter-word segregation and juncture).
Introduction
Children’s communicative competence in connected speech tasks relates directly to their performance in everyday social interactions [1]. However, connected speech is sampled and transcribed less frequently than single words [2, 3]. Broad phonemic transcription of single-word naming assessments (and subsequent segmental analysis) is the most common procedure in speech assessment [2, 3]. While direct connected speech assessment is not commonly included in assessment batteries, there are some well-established connected speech sampling techniques. For example, connected speech is most frequently elicited through spontaneous conversational utterances [4] with a minimum of 50 utterances recommended to obtain a valid sample (e.g., [4-7]).
Collecting speech samples in different elicitation contexts (i.e., connected speech and single words) allows speech-language pathologists to gain a more complete understanding of the speech of children with speech disorders. The characteristics of speech that are present in single words may differ from those found in connected speech samples of the same child (e.g., [8-10]). Morrison and Shriberg [8] identified that connected speech provides information regarding error patterns present in inter-word transitions (alongside measures of single-word accuracy [e.g., percentage of consonants correct – PCC]). Coarticulatory differences present in spontaneous conversational speech should be documented when transcribed, illustrating the importance of effective transcription.
Morrison and Shriberg [8] (as well as Masterson et al. [9] and Wolk and Meisler [10]) compared the accuracy of single-word and connected speech samples of children with speech disorders using speech analysis procedures based on variations of the Independent and Relational Phonological Analysis framework [11]. Independent analyses frequently reported include inventories of singleton phonemes, consonant clusters, syllable shapes, word length, and the lexical stress. Relational analyses include the calculation of the PCC, percentage of vowels correct [12], syllable shapes correct, word lengths correct, and syllable stress patterns correct. Additionally, phonological processes and loss of phonemic contrasts were noted. Although the combinations of these analyses reported in the single-word and connected speech comparison studies [8-10] were comprehensive, they failed to consider the unique characteristics of connected speech.
More recent literature has highlighted the presence of characteristics which are unique to connected speech [13]. To speak in a connected string of sounds, as both children and adults do, there must be coordination of the motoric requirements of words, as well as of the prosodic and word juncture behaviours to link those words together [14]. Juncture between words is one of the unique characteristics of connected speech. Close juncture such as the linking /ɹ, w/, and /j/ is present between words where a word final vowel and word initial vowel are adjacent [15]. For example, consider the phrases “two eggs,” “three eggs,” and “four eggs.” Here, most English speakers use a linking /w/ in /tu wɛgz/, a linking /j/ in /θɹi jɛgz/, and a linking /ɹ/ (a type of liaison) in /fɔ ɹɛgz/. Note that linking /ɹ/ is only relevant for analysis in non-rhotic dialects of English [16]. Open juncture includes assimilatory and elision behaviours that appear at word junctures in connected speech (cf. [13, 17, 18]). Assimilation across syllables occurs such that a neighbouring phoneme influences the sound (e.g., “handbag” /hændbæg/ → [hæmbæg]). Elision refers to the omission of one or more sounds in a word at the boundary with an adjacent word (e.g., “curved round” /kɜvd raʊnd/ → [kɜv raʊnd]). Additionally, there is evidence to suggest that coarticulatory factors may be present in connected speech. Coarticulation (a form of assimilation) describes the impact of preceding and following phonemes on the articulatory characteristics of a segment. These changes may occur across syllable and word boundaries [19]. Coarticulatory characteristics of connected speech were highlighted by Daniloff and Moll [20], who proposed that the articulatory requirements for a given segment affected production of up to four preceding phonemes irrespective of word position.
Segmental and single-word analyses (such as those used by Morrison and Shriberg [8]) may not provide sufficient information regarding the characteristics unique to connected speech. Thus, it follows that single-word transcription methods may be insufficient for providing information on connected speech characteristics. Instead, transcription and analysis which considers the interaction of phonemes across word boundaries may provide a richer documentation of connected speech and allow utterances to be viewed as a series of non-linear tiers. Such analysis would involve transcription of words as connected (forming a string of phonemes). Considering speech as connected in this way reflects the perspectives proposed in the articulatory gesture literature (e.g., [21]), which suggests that the onset/conclusion of a connected string of phonemes can only be defined by a breath pause. Consistent with the literature (e.g., [22]), this paper will use the term “speech string” to describe connected phonemes which, in this case, span word boundaries. Selection of speech strings may be undertaken similarly to utterance segmentation, with non-hesitation pauses forming their boundaries [23]. A shift away from single-word-focussed transcription to a procedure involving transcription of speech strings is thus indicated. To guide this transition, a connected speech transcription protocol is required.
Considering connected speech characteristics is particularly important for clinical populations displaying difficulty with speech tasks of increasing complexity. Children with the motor planning disorder childhood apraxia of speech (CAS) are one such population [24]. Three key features of CAS were reported in the 2007 ASHA technical report [24]: (1) inconsistency of errors on phonemes (demonstrated on repeated productions of sounds, syllables, or words), (2) irregular lexical stress patterning that results in inappropriate prosody, and (3) coarticulatory transitions between syllables that are disrupted or elongated. Such disrupted coarticulatory transitions may be present between words (as characterised by Shriberg’s pause marker [25]). However, there are limited data to quantify what the presence of these features, or others, may mean for connected speech in children with CAS. Perceptual analysis of the speech features of CAS is the current gold standard for diagnosis and is often undertaken primarily in a single-word context (e.g., [26-28]). Some studies have also utilised the Prosody-Voice Screening Profile Tool [29] or acoustic analysis (e.g., [30]) to describe the speech of children with CAS. The perceptual characteristics of connected speech in children with CAS have been described (e.g., [25, 31-35]). However, no quantitative connected speech profile has been developed for these children or others with more severe speech disorders. We may therefore learn more about the speech of children with CAS and others from a transcription method which allows for more detailed connected speech analysis.
Aims
This study had three inter-related aims: (1) to develop a transcription protocol for the connected speech of children with speech disorders, with potential use for typically developing populations; (2) to follow the newly formulated transcription protocol to transcribe the connected speech of children with CAS, thus determining the practicality of the protocol’s use; and (3) to analyse the connected speech transcriptions to describe the speech of the children with CAS.
To address these aims, this research was conducted in two parts: Study 1 was the development of the Connected Speech Transcription Protocol (CoST-P) [36] and Study 2 was the implementation of the CoST-P with a population of children with CAS. These studies will be described in detail below.
Materials and Methods
Participants
A convenience sample of participants recruited for a previous study [37] were selected to determine the feasibility of the transcription protocol and to subsequently describe the children’s connected speech. Twelve children (nine male; three female) aged 6–13 years with CAS were recruited from the Sydney metropolitan area (Table 1). All participants met the following three criteria: (1) diagnosis of CAS as determined through consensus of two speech-language pathologists (authors two and four) with expertise in the clinical diagnosis of CAS using the current gold standard [31]. For a child to be diagnosed with CAS, their speech needed to include the three core features of CAS as identified in the ASHA technical report [24], and four of the ten features of CAS described in Shriberg et al. [38]. (2) English as their primary language and at least one parent with English as their primary language (note that all children used a non-rhotic dialect of English). (3) Non-verbal intelligence and receptive language within the normal range as determined through the Test of Non-Verbal Intelligence – third edition [39], and the Clinical Evaluation of Language Fundamentals – fourth edition [40]. Children were excluded from the study if craniofacial abnormalities were noted in an oral musculature assessment or if they had other motor speech disorders as assessed through the Goldman-Fristoe Test of Articulation – second edition [41] and a connected speech sample. Children were also excluded from the study if they had hearing disabilities as determined through a hearing screener or if they had other developmental or genetic diagnoses.
All assessments were undertaken at The University of Sydney and were audio and video recorded. Informed consent was provided by the parent/caregiver of all participants before inclusion in the study. The study was approved by The University of Sydney Ethics Committee (2015/516).
Fifty spontaneous connected speech utterances were extracted from each participant’s initial assessment session recording. These utterances were produced during free conversational speech with the assessing speech-language pathologist, during a deliberately collected connected speech sample based on description of the Park Play Scene [42] or during connected speech elicited during completion of the expressive component of the Word Classes subtest of the Clinical Evaluation of Language Fundamentals – fourth edition (CELF-4) [40]. That is, they were not contiguous. To be considered as connected speech, the utterance needed to include a minimum of two words.
Study 1: Protocol Development and Validation
Materials and Methods
A literature review was undertaken by searching relevant databases (e.g., Scopus, Education Resources Information Centre, and Linguistic and Language Behaviour Abstracts) with key terms including “transcription,” “connected speech,” “spontaneous speech,” “conversational speech,” and “conversational sample.” This literature review informed the development of an initial version of the CoST-P [36], which provided a list of 25 recommendations for the transcription of the connected speech of children with speech disorders. The initial version did not provide a comprehensive explanation of the preferred ways to capture connected speech and did not include examples or explanations of terminology and recommendations. A series of trial-and-revise cycles using the connected speech samples from the participants of Study 2 were then undertaken to develop the protocol iteratively. The final version of the protocol is provided in Appendix 1. Thus, following the procedures laid out in the CoST-P, utterances from each child were transcribed by the first author directly into Phon 3.0 [43], a freely available speech analysis software.
Transcription of connected speech samples involved broad phonemic transcription with use of additional diacritics. Sennheiser HD280 Pro headphones were used for all transcriptions. A limit of five listening repetitions of each audio segment was imposed for transcription purposes (to eliminate the effects of auditory-visual illusions [44]). All transcriptions were reviewed once the transcription was complete to ensure consistency to the final protocol.
A protocol fidelity measure was developed to evaluate implementation of the CoST-P. The fidelity (i.e., a transcriber’s adherence to the protocol) was assessed using a transcriber who had been trained in phonetic and phonemic transcription but had not been involved in the development of the protocol. For the purposes of determining the protocol fidelity, participant samples were purposively selected to capture a range of speech disorder severity. Thus, the protocol fidelity measure was used to determine an independent transcriber’s application of the CoST-P with ten utterances from three participants: (1) one child with a low PCC (i.e., considered too unintelligible for relational analysis), (2) one child with a moderate PCC (between 70–80% of consonants correct), and (3) one child with a relatively high PCC (above 90%). The fidelity rater chose to transcribe by hand, and the entire transcription process was filmed using an iPhone camera. Following the completion of the three sets of sample transcriptions, the first author viewed the footage and completed the protocol fidelity measure for each sample transcribed. No changes to the protocol were made following completion of the fidelity assessment.
Intra- and inter-rater reliability of connected speech transcriptions were conducted on a randomly selected 20% of the samples. Intra-rater reliability transcription was conducted 4 weeks after the initial transcription. The second author transcribed a randomly selected 20% of the samples for inter-rater reliability. The transcription protocol remained constant, with each utterance listened to a maximum of five times. Both point-point reliability of phonemes and total point-point reliability (including diacritics and notes) were calculated.
Results
A protocol containing six broad recommendations, ten sub-recommendations, and 32 considerations was developed after eight trial-and-revise cycles. The recommendations and considerations were then operationalised into the tasks included in the CoST-P (Appendix 1). Multiple versions of the CoST-P were developed to account for considerations which arose during the transcription process (Table 2). Although the CoST-P may be used for either paper-based or computer-based transcription, some of the revisions were prompted by challenges completing the computer-based transcription using Phon [43]. The CoST-P provides recommendations regarding the transcription of the children’s target and actual productions. Broad phonemic transcription should be used, with additional diacritics to be included when transcribing distortions and errors in the children’s actual productions. A key feature of the CoST-P is the recommendation for words to be transcribed as connected where they run together naturally in connected speech (as this allows for the identification and transcription of inter-word features such as inappropriate pause). Other important considerations relate to management of contextually and/or dialectally appropriate reduced forms, assimilation, consonant deletion, between-syllable epenthesis, and juncture (linking /ɹ, w, j/). There is also a section of the CoST-P which relates to revision of the target transcription based on the context of the actual production to allow for correct alignment (matching) between the orthographic, target, and actual transcriptions.
Total protocol fidelity based on the transcription of ten utterances from three samples yielded a score of 88%. Transcription of features such as dysfluencies had high fidelity (100%). However, transcribing the utterances as speech strings had low fidelity (55%). The fidelity rater commented that she had difficulty remembering to transcribe the speech as connected, and frequently forgot to include the juncture phonemes. Intra- and inter-rater reliability scores of 92 and 86%, respectively, were achieved regarding broad phonemic transcription (excluding diacritics) on 3,205 data points. Scores of 86% (intra-rater reliability) and 79% (inter-rater reliability) were achieved when diacritics were included. There was no substantial pattern to the disagreement between raters. However, one potential source of difference was the frequency of one transcriber using a glottal stop phoneme where the other transcriber marked an inappropriate pause. Following the CoST-P, transcription of a child’s production of 50 utterances of connected speech into the three tiers (“Orthography,” “IPA Target,” and “IPA Actual”) took researchers between 5 and 7 hours.
Study 2: CAS Connected Speech Characteristics: Protocol Implementation
Materials and Methods
All transcriptions were completed using the CoST-P and entered into Phon 3.0 [43]. Within Phon, data entry consisted of the orthography of the phrase, target production of the phrase (IPA Target tier), and the realised or actual production of the participant (IPA Actual tier). It is also possible to document notes for each utterance within the Notes tier in Phon [45]. Once completed, these connected speech samples were analysed to address the secondary aim of the study: description of the connected speech of children with CAS.
Speech Analysis
Speech characteristics identified in the independent and relational phonological analyses originally described by Stoel-Gammon and Dunn [11] and modified by Baker [46] were analysed using Phon [43]. The independent and relational analyses which were conducted are outlined below.
Independent Analyses. Independent analyses included calculations of the number of phonemes, syllables, and connected speech strings (as opposed to words), as well as analysis of the types of stress patterns and syllable shapes. The output consisted of counts of primary, secondary, and unstressed syllables, as well as the frequency of occurrences of inter- and intra-word segregation. Counts of instances of juncture (i.e., the presence of linking /ɹ, w/, and /j/), dysfluency, and voice (i.e., counts of resonance/voice diacritics) were also undertaken.
Relational Analyses. The relational analysis included calculation of the percentage of correct phonemes (consonants and vowels), syllables, connected speech strings, stress patterns, and instances of juncture. Note that close juncture behaviours will be referred to as juncture in this paper. Additionally, the PCC-revised (PCC-R), which does not account for distortions [47], was calculated. Also, the proportion of whole-word proximity [48] – a measure of the actual phonological mean length of utterance (PMLU) divided by the target PMLU – was calculated.
Both built-in analyses (e.g., “Phonemic Inventory” and “Word Level Analysis of Polysyllables” [49]) and custom queries (e.g., frequency counts of stress patterns) in Phon 3.0 [43] were used to extract the required data from the connected speech samples. The data produced by Phon were summarised using Microsoft Excel. This involved summing values (e.g., number of phonemes correct) to produce totals for each participant. Additionally, totals across participants, means, and standard deviations were calculated. Note that some analyses at word level were not undertaken as speech strings rather than words were transcribed.
In addition to the data extracted from children’s connected speech, single-word data for the participants were sourced from the original study file. The PCC for single words was calculated based on the participants’ raw scores on the Goldman-Fristoe Test of Articulation – second edition [41]. PCC was obtained by first subtracting the raw score (number of errors) from the total number of target phonemes, then dividing by the total number of targets, and finally multiplying by 100.
Statistical Analysis
Statistical analysis was conducted using Microsoft Excel. For the descriptive component of the study, the analysis involved calculations including means and standard deviations. Pearson’s correlation coefficient (r) was used to evaluate the relationship between the children’s PCC in single words and PCC in connected speech. Paired two-tailed t tests were performed to determine whether the differences between single-word and connected speech PCCs were significant. A default statistical significance level of p = 0.05 was selected.
Results
Independent Analysis
Independent analyses were conducted on all twelve participants. The samples consisted of multiple connected speech tasks. All samples contained free connected speech. Six samples contained connected utterances based on the Park Play Scene [42] and six contained connected speech elicited during the completion of the CELF-4 [40]. The participants produced a mean of 1,028 (SD = 491) phonemes across the 12 50-utterance speech samples analysed. Participants used more syllables with primary stress (m = 258, SD = 77) than secondary stress (m = 33, SD = 15) or unstressed (m = 28, SD = 39) syllables. Proportionally, syllables with primary stress were an average of 81% of total syllables, secondary stress were 10%, and unstressed syllables comprised 9%. Participants used a mean of 14 (SD = 6) orthographic words containing three or more syllables. There was a mean of 29 (SD = 21) instances of inter-word segregation in the samples, compared to the mean of 4 instances of intra-word segregation. Table 3 provides a summary of the independent analyses completed.
Relational Analysis
Relational analysis was conducted on the speech of ten participants. The speech of two participants was considered too unintelligible to reliably transcribe the target productions (consistent with the protocol instructions).
The mean PCC was 80 (SD = 10) and the mean percentage of vowels correct was 88 (SD = 9). The accuracy of the speech strings (i.e., the percentage of correct phonemes and stress patterns) was 35% (SD = 9). The accuracy of juncture use (i.e., the percentage of appropriate use of the linking /ɹ, w/, and /j/) was 54% (SD = 34). The mean percentage of correct productions of pre-tonic unstressed syllables was 73% (SD = 17), whereas the mean for stressed initial syllables was 94% (SD = 5). Figure 1 provides a summary of the relational analyses. Supplementary Material A (available online at www.karger.com/doi/10.1159/000500664) provides individual participant data for a range of independent and relational measures.
The summary of phonological process analysis is provided in Table 4. Speech processes which were present on less than 5% of relevant phonemes (for each child) are not included (to account for typical errors or “slips of the tongue” during speech production). Excluded segmental processes therefore included velar fronting, coronal backing, and lateralisation, as well as the structural processes of metathesis, epenthesis, and assimilation. The process which occurred on the largest percentage of targets was deaffrication (i.e., affricates replaced with fricatives), affecting a mean of 32% of affricates (SD = 39). However, deaffrication was only present in the speech of six out of the ten children. All participants had deletion errors and deleted a mean number of 55 phonemes (SD = 34). The most common deletion occurred on final consonants (defined in this study as deletion of some or all coda consonants), with a mean of 32 instances per child (SD = 22), resulting in a mean of 17% (SD = 15) of total word-final consonants being deleted.
Relationships between Speech Production Variables
Relationships between variables were examined using Pearson’s correlations. Most correlations produced an insignificant result and a small r value. However, several of the more interesting results (which may warrant further study) are reported here for the purposes of discussion. The correlation between juncture accuracy and PCC was not significant, with an r value of 0.62 (p = 0.056). However, there was a significant positive correlation between juncture accuracy and percentage of phonemes correct (PPC) (r = 0.66, p = 0.038). The correlations between instances of inter-word segregation and PCC or PPC were not significant, characterised by r values of 0.07 (p = 0.85) and 0.1 (p = 0.78), respectively. The correlation between PCC and final consonant deletion was characterised by an r value of 0.58 (p = 0.08).
There was a relationship between the children’s PCC in single words and PCC in connected speech (r = 0.75, p = 0.01). Results of a two-tailed paired t test indicated no significant difference between participants’ single-word and connected speech PCC scores (t [df 9] = 0.8, p = 0.4).
Discussion
This study has resulted in the development of the CoST-P [36], which provides an evidence-based method for consistent connected speech transcription. During the CoST-P development phase, multiple versions were trialled, with the final protocol providing six broad recommendations and ten sub-recommendations. The sub-recommendations indicate the process which should be undertaken to complete the transcription in the most efficient and effective manner. Additionally, 32 considerations are documented which provide instruction regarding specific contingencies that may be required when transcribing the connected speech of children (e.g., how to manage the occurrence of contextually/linguistically appropriate reduced forms).
The CoST-P was applied to the transcription of the connected speech of children with CAS. The CoST-P was used with adequate reliability and fidelity scores (consistent with Landis and Koch [50]) and therefore, can be considered for use in the study of children’s connected speech. Transcription of 50 utterances using the CoST-P (including the orthography, the target output, and the actual production) took between 5 and 7 h per child. This is a significant time commitment, which will limit use of the current CoST-P in daily speech-language pathology practice. Further study, focussed on the formulation of a quick version of the protocol, may help address this issue.
CAS Characteristics
Transcription of the speech of children with CAS using the CoST-P allowed for the quantitative description of connected speech characteristics of this population. These descriptive results provide clinicians and researchers with an initial profile of the connected speech of children with CAS. Overall, the independent and relational analysis data appear to be consistent with the CAS speech characteristics reported in the single-word literature. For example, inappropriate prosody (in relation to lexical/phrasal stress) is reported to be a core feature of the speech of children with CAS [24]. The connected speech results here supported this finding, with a mean of only 62% of the speech string stress patterns produced correctly. Additionally, there was a relatively large standard deviation to this accuracy measure, reflecting reported heterogeneity in children with CAS [24]. These findings are of particular note in light of typically developing children’s innate receptivity to the lexical stress patterning of the ambient language [51] (e.g., natural production of trochaic disyllabic nouns and iambic productions of disyllabic verbs [52]).
Although not specific to inter-word juncture, another core CAS feature is lengthened/disrupted coarticulatory transitions between sounds and syllables [24]. This was present in connected speech, represented by the high mean frequency of inappropriate inter-word segregation (pauses). However, there was significant variation in the count of these instances across the participants. One participant’s sample had only five instances of inappropriate inter-word segregation, whilst another participant had 67 inappropriate inter-word segregations. The accuracy of use of juncture (linking) phonemes was 54%; however, the variation was again large for this measure (SD = 34). These results suggest that inappropriate inter-word segregation (and analysis of juncture) may not serve as the most effective means of diagnosis of CAS due to the significant variation in participant presentation of this feature in free connected speech. Also, there was no correlation found between the instances of inappropriate inter-word segregation (pauses) and PPC or PCC. This indicates that there is no link between the occurrence of this potential CAS marker and the traditional metrics of determining speech disorder severity. This lack of correlation replicates the results of Shriberg et al. [25], who reported that the pause marker scores (which provide a percentage of the instances of inter-word pauses) were not associated with other measures of speech precision or competence for children with CAS (e.g., phoneme accuracy, lexical stress, and speech rate). However, the present study identified a moderate correlation between accuracy of juncture and both PPC and PCC, indicating that juncture accuracy may be a beneficial metric for determining disorder severity amongst children with CAS.
Interestingly, whilst all children displayed inappropriate inter-word segregation (with a mean of 29 instances), only seven participants had inappropriate intra-word segregation, with a mean of four instances per child. One potential source of this discrepancy is the phenomenon of avoidance reported by Morrison and Shriberg [8] whereby children do not produce certain challenging phonemes, syllable shapes, and word shapes. This may be a factor as children with CAS are known to have increasing difficulties with increasing word length [24]. The results support this, with the children producing a mean of only 14 words containing three or more syllables across all 50 utterances (4% of all words). This result contrasts with Stoel-Gammon’s study of typically developing children aged 30 months, which found that polysyllabic words constituted 8% of total words produced [53]. Furthermore, the count of polysyllabic words in this study may be inflated, as six of the participants discussed the “Park Play Scene” [42], which includes multiple polysyllabic items. The overall lower frequency of polysyllabic words could explain the presence of fewer instances of intra-word segregation and would also partially account for the relatively high proportion of syllables with primary stress. However, further research would be needed to ascertain whether a phenomenon such as avoidance is involved in word choice for these children.
Although CAS is a disorder of motor planning, it may be of benefit to place the results in phonological context, considering that these children are in a period of global development (e.g., [54, 55]). Furthermore, the intersection between phonology and motor speech has been discussed in the literature (e.g., [56–58]), with children with reduced motor speech success demonstrating difficulty learning expressive phonological contrast (e.g., [59, 60]). All participants demonstrated the structural phonological process of deletion, suggesting that this may be a common feature amongst children with CAS. The presence of deletion may have resulted from difficulties with coarticulatory transitions (a core diagnostic feature of CAS) [24]. Deletion of word codas (partial or complete) was the most common process, reflecting the literature which indicates that this is common amongst less intelligible children (e.g., [61]). This was further supported by the moderate correlation found between PCC and instances of word coda deletion. The segmental phonological process of deaffrication was present on 32% of the total affricates. However, as only six of the ten children produced this error and two of the children had 100% deaffrication, the results were skewed. While structural processes such as metathesis, epenthesis, and assimilation have been described as common in the speech of children with CAS (due to phoneme and syllable sequencing difficulties [62]), it is interesting to note these structural processes were present on less than 5% of total phonemes. It is possible that avoidance of complex words affected these results [8].
Unstressed initial syllables in iambic (wS) stress patterns (considered “un-footed” in the non-linear phonology literature) are more vulnerable to omission than trochaic patterns (i.e., words containing “footed”/stressed initial syllables) [63]. This was reflected by the greater accuracy of stressed versus unstressed initial syllables in the sample. The bias towards productions of trochaic patterns is a feature of the speech of children with general speech disorders, and thus, these results may not reflect the irregular lexical stress patterns identified as a core feature of CAS [24]. However, it is possible that whilst the patterns themselves are not irregular, the distinguishing factor is the continued presence of these errors past the age at which this issue typically resolves.
There was a strong correlation between PCC in single words versus PCC in connected speech. Five of the ten participants had a higher PCC in connected speech, while three had a higher PCC in the single-word task. The two remaining children had the same PCC across both tasks. The children’s ability to control the content of their connected speech output may have led to the higher average PCC in the connected speech productions, as discussed in the literature (e.g., [8, 64]). Further, the sample size of 10 children with 50 utterances may have limited our ability to identify a more prominent correlation between the data sets and draw more definitive conclusions regarding the impact of avoidance or other factors.
We note that independent analysis of the connected speech of children with severe speech disorders (who have very poor intelligibility) provides information on inventories of phonemes, syllable shapes, and stress patterns. However, the unstructured connected speech context meant that transcription of target utterances was not possible and therefore, relational analyses and some independent analyses (such as identification of instances of juncture or dysfluency) could not be captured. More controlled assessment tasks, such as imitated sentence tasks, with known speech targets may be of greater value for both study and therapy goal planning for children with severe speech disorders. Such tasks may also reduce transcription time substantially.
CoST-P
Although there was no significant difference in single-word versus connected speech consonant accuracy measures, transcription following the CoST-P allows for consideration of connected speech in context (rather than consonant accuracy alone). Thus, the CoST-P captures a number of speech characteristics which may otherwise not be considered by speech-language pathologists and others interested in child speech development and disorder.
Two artificial utterance samples have been provided in Table 5 to assist with description of these features. The first displays transcription following a word-by-word approach; the second displays transcription using the CoST-P. Although some similarities can be observed, the connected speech transcription yields a different description and interpretation of the child’s speech.
Both methods allow for the development of a phonemic inventory (e.g., consonantsː /k, l, w, d, m, n, t, f, s, h, r/, and vowelsː /aɪ, ʌ, ə, a, ɛ, ɒ, ɪ, u, æ/) and analysis of phonological processes (e.g., stopping of fricative /ð/ and affricate /dʒ/). Also, calculation of other features (such as nasality) is possible using both methods. However, the context of within-speech string position may provide more information than within-word position. For example, word-by-word transcription may note inappropriate hyper-nasality on the /ɒ/ in /ɒ̃ɹəndʒ/ as there is no adjacent nasal consonant affecting its production. Contrastingly, transcription using the CoST-P highlights the proximity of the /ɒ/ to a nasal consonant (/m/), demonstrating the appropriateness of hypernasality in this instance [65]. Similarly connected speech transcription provides a context for the presence of appropriate aspiration of voiceless plosives (whereas the single word transcription can highlight this as an error due to word boundaries (e.g., hætʰ ɒn). Both methods can capture additional speech characteristics such as dysfluencies (no instances present in the sample). Also, while both methods allow for counts of intra-word segregations (one instance in this example), transcription using the CoST-P enables documentation of frequency of inter-word segregations (two instances displayed in the example). The CoST-P also allows for analysis of instances of inter-word juncture behaviours. Further research is required to understand the extent of the implications of use of the CoST-P in assessment, diagnosis, and treatment. Using the CoST-P to formulate connected speech norms (and identify disorder-specific connected speech characteristics) may result in connected speech transcription becoming more important as a contributor to accurate and detailed diagnosis. Furthermore, the additional speech characteristics which can be observed through connected speech transcription may assist with the development of more meaningful therapy approaches, goals, and therapy targets.
The reliability score was acceptable (i.e., over 85%). Of note was the difference in the raters’ perceptions of pause, where one rater transcribed glottal stops and the other marked inappropriate inter-word segregation. Both replacement of word codas by glottal stops and consonant deletion (resulting in inappropriate inter-word pauses) have been reported in the literature as strategies to minimise articulation effort [18]. However, it was noted by Newton [66] that glottal stops do not occur on instances where the first consonant within the cluster is a nasal or approximant. This suggests that the protocol may benefit from further refinement to include specific recommendations regarding appropriate use of glottal stops and pause markers.
The fidelity was in the desirable range, indicating that the areas of difference were not of significant concern. However, the poor fidelity results regarding transcribing connected speech in strings (as opposed to word by word) highlights that a change in conceptualisation of connected speech will be required before the CoST-P may be regularly used. That is, transcription of speech strings and inclusion of features such as juncture (e.g., a linking /ɹ/) are not the current transcription default in speech-language pathology research or practice. Therefore, time and training will be required to facilitate the shift and establish more regular use of the CoST-P. However, this change may be justified in light of the preliminary results of this study, which indicate that speech string transcription assists in capturing the unique characteristics of connected speech.
Study Limitations and Future Research Directions
There are several factors which have limited the scope of this study in relation to both the CoST-P and the description of CAS. The use of a convenience sample of pre-recorded spontaneous productions may have resulted in the capturing of speech which did not contain a complete segmental or structural inventory (as the children were not required to produce specific phonemes). Studies containing structured and unstructured speech tasks transcribed using the CoST-P may provide information on the effects of avoidance, especially on unique connected speech features such as juncture. Additionally, a comparison sample from a normative, typically developing population may assist in placing the results regarding CAS speech characteristics in context. Future studies which assess the use of the CoST-P for children with other speech disorders, such as dysarthria and more severe phonological disorders, are required. This will help determine the protocol’s reliability and fidelity more broadly, thus ascertaining the suitability of its use with a wide range of clinical and developmentally typical populations. Additionally, further studies are required to determine the CoST-P’s applicability for children speaking different dialects of English (including non-rhotic dialects) as well as its use for other languages.
An important next step in increasing the practicality of the CoST-P is the development of a version which reduces the time required to complete a transcription. Additionally, further training resources and examples for transcribers, as well as alterations to the CoST-P based on the results of the fidelity assessment may increase the protocol’s value for researchers and clinicians.
Further study is needed to determine the most appropriate tools and analyses to use to glean the most meaningful information following connected speech transcription. Direct comparison between single words and connected speech (transcribed using the CoST-P) is another avenue for future research. This may highlight the discrepancies between the two output forms previously identified in the literature (e.g., [8]) and may display the benefit of use of the CoST-P to identify the elements that are not present in the transcription of single words. Additionally, studies to determine the extent to which connected speech features such as juncture may be of benefit in assessment and diagnosis are an important step. Further insights into the foundations of connected speech may be obtained by considering this data through other theoretical lenses.
Conclusion
The transcription and analysis of the connected speech of children with speech disorders poses multiple challenges to clinicians and researchers. This study has discussed the development of a transcription protocol for connected speech and its subsequent use with a sample of children with CAS. Conceptualising connected speech as a string of sounds rather than as individual word units assists in developing a deeper understanding of a child’s speech capabilities and limitations. Given the preliminary nature of the results, limited generalisation of these results is recommended. However, it can be seen that transcription using the CoST-P provides information on features unique to connected speech, such as the presence of inappropriate inter-word segregation and juncture. The CoST-P may be a meaningful resource for speech-language pathologists, promoting increased use and awareness of the benefits of connected speech sampling and transcription.
Acknowledgement
Natalie Cavagnino completed the fidelity test. Yvan Rose assisted with use of Phon to complete analyses. Rob Heard assisted with statistical analysis.
Statement of Ethics
Subjects (or their parents/guardians) have given their written informed consent. The study protocol has been approved by the University of Sydney committee on human research.
Disclosure Statement
The authors have no conflicts of interest to declare.
Funding Sources
The participant data was sourced from a previous study, which was funded by a grant from the Childhood Apraxia of Speech Association of North America (CASANA).
Author Contributions
The concept for the paper was conceived by Catherine Barrett and Patricia McCabe with substantial input from Sarah Masso. Jonathan Preston and Patricia McCabe supervised collection of the participants’ data. Catherine Barrett completed all tiers of transcription and completed the intra-rater reliability. Sarah Masso assisted with the use of Phon to complete analyses. Patricia McCabe completed inter-rater reliability. Catherine Barrett wrote the initial draft of the manuscript. Patricia McCabe and Sarah Masso provided conceptual advice and multiple edits. Jonathan Preston read and commented on two drafts of the manuscript.
Appendix
Connected Speech Transcription Protocol (CoST-P) (Barrett et al., 2018)
Before phonemic transcription
1Gather information about the child’s dialect/languages spoken at home (other than the target language). Phonemes, syllable/word structures, and stress patterns will vary across languages/dialects. The Multilingual Children’s Speech Sound Disorders website (www.csu.edu.au/research/multilingual-speech/home) is a suitable starting place.
2Document the conversational context
3Obtain a minimum of 50 of the child’s connected utterances, and orthographically transcribe as many of the productions as possible
During phonemic transcription
1Transcribe the target production using IPA (where possible)
(a) Use broad phonemic transcription.
(b) Transcribe words as connected (i.e., without boundaries) where words run together naturally in connected speech.
General considerations
i. Transcribe contextually appropriate and/or dialectically appropriate:
reduced forms
For example, /gɒnə/ may be an appropriate shortening of /gɒʊɪŋ tu/.
assimilation of consonants within/between words
For example, “sandwich” may be produced as /sæmwitʃ/.
consonant deletion
For example, deletion of /t/ or /d/, such as /doʊnt noʊ/ → /doʊn noʊ/.
between-syllable epenthesis
For example, “hamster” is commonly produced as /hæmpstə/.
juncture (i.e., a linking /ɹ, w/, or /j/ should be present on all occasions between two adjacent vowels within a speech string)
For example, “car is” = /kaɹɪz/; “cow is” = /kaʊwɪz/; ‘they are’ = /ðeɪja/.
ii. Do not use diacritics to transcribe allophonic variation where variation is linguistically appropriate. For example, hypernasality on vowels preceding or following a nasal consonant (e.g., /ænt/ or /mæn/), aspiration on voiceless plosives before a stressed vowel (e.g., /ti/).
Considerations when unintelligible
Where an utterance is considered unintelligible on first inspectionː
i. Move to the next utterance and return to the difficult section once you are more familiar with the child’s speech output.
ii. Transcribe as much information for the target as possible (consider context).
iii. If the utterance is completely unintelligible and the target is not possible to ascertain, leave the target transcription blank. Independent analysis may still be possible using these utterances.
(c) Note that you may need to return to “target” productions after transcription of “actual” productions (see considerations for revision of target form [3a]).
2Transcribe actual productions using IPA (where possible)
(a) Use broad phonemic transcription with phonetic transcription of phonemes where required (i.e., use additional diacritics to represent distortions/errors). For example, use approximation, lateralisation, dentalisation, or vowel lengthening diacritics.
(b) If the error cannot be described with the use of a diacritic, search the full IPA chart to find a matching phoneme (i.e., do not limit transcription to use of English phonemes).
(c) Do not alter the audio file used for transcription (e.g., segmentation, pitch shifts, or speed reduction) – altering an utterance may result in loss of phonemic information.
(d) Transcribe all:
i. moments of within-word syllable segregation using a symbol or marker (for example, a pause marker if using Phon).
ii. moments of inappropriate between-word segregation as a lengthened pause (note that appropriate pauses between words can be marked with a space).
iii. moments of stereotyped speech (e.g., counting). However, it is important to note “stereotyped utterance” below.
iv. moments of dysfluency phonemically (transfer into IPA target and orthography). Note each occurrence of dysfluency with a code (e.g., “$”).
v. reduced forms of words (see target transcription general considerations for an example).
vi. consonant assimilation within/between words (see target transcription general considerations for an example).
vii. consonant deletion between words (see target transcription general considerations for an example).
viii. between-sound/between-syllable epenthesis (see target transcription general considerations for an example).
ix. instances of juncture (linking /ɹ, r, w, j/) (see target transcription general considerations for an example).
x. moments of children whispering or talking in a sing-song prosody (note with an appropriate symbol).
General considerations
i. Where vowel elongation occurs, and this would alter the vowel transcription, transcribe as the new vowel (e.g., /ɪ/ → /i/). If the vowel is lengthened further, transcribe as the new phoneme with the relevant diacritic.
ii. Apply relevant diacritics for distorted consonants (e.g., approximation, lateralisation, dentalisation, or vowel/consonant lengthening diacritics).
iii. Mark voice change using diacritics (e.g., resonance or ejective diacritics) when linguistically inappropriate. Do not mark where allophonic variation is expected (see example in target transcription general consideration ii). If there is consistent inappropriate voice, mark this in notes, as this may be indicative of a voice disorder rather than phoneme-associated voice change.
iv. Where there is lack of audible release on a final consonant (/p, b, t, d, k, g/), but the consonant is marked, transcribe as target consonant (e.g., “map,” “jog”).
v. Where a pause is audible, make a consistent transcription decision regarding whether the child is producing a glottal stop or an inappropriate pause.
Considerations when unintelligible
i. Transcribe all transcribable phonemes (following protocol above).
ii. If the manner of consonants is audible, but the place of consonant is unidentifiable, transcribe using a generic marker from the extended IPA (e.g., X).
iii. If the place and manner of a consonant is not audible, transcribe with a generic consonant marker (e.g., C).
iv. if a vowel’s place is unknown, transcribe with a generic vowel marker (e.g., V).
3Match the orthographic transcription to the IPA Target and the IPA Actual where possible
(a) Revision of the Target form
i. If the Target is transcribed as an altered form (i.e., as a reduced form, with consonant assimilation or deletion, or with between-sound/syllable epenthesis), but is produced as the citation form by the child, determine the appropriateness based on the linguistic context/ambient language. If citation form is appropriate in the Actual, alter the Target to match.
ii. If contextually/linguistically appropriate juncture (e.g., linking /ɹ, r, w, j/) has been transcribed in the Actual but has not been included in the transcription of the Target, alter target to match.
iii. If contextually/linguistically appropriate altered forms are present in the Actual production, but are not transcribed in the Target, alter Target to match.
iv. If a between-word segregation marker is present in the Actual, but juncture has been transcribed in the Target, delete the juncture phoneme from the Target, keeping the words connected. This ensures that the error is being marked by one (rather than two) differences between the Target and Actual (essential if using transcription software).
(b) Where alignment decisions are not clear, refer to an evidence-based protocol for syllabification rules (e.g., Grunwell [1987]).
(c) Once alignment is completed, the next steps depend on the goals of transcription. You may:
i. Informally study the actual transcription and compare to the target transcription to understand the child’s connected speech competence (with regards to key features/areas of difference) OR
ii. Study each disparity between the target and actual on a point-point basis to identify each error/difference, allowing for more precise study of the child’s connected speech.