Who countest the steps of the Sun:
Seeking after that sweet golden clime
Where the travellers journey is done.
Importantly, the modifications of the various parallelistic features did not affect several other features that are likewise characteristic of the type of poetry used in our study. For instance, a low degree of narrative content, unmediated evocations of highly personal and highly emotional speech situations, and the frequent addressing of an absent person/agent who is or was highly significant for the lyrical speaker are found across all versions of the poems. Moreover, non-parallelistic features of poetic diction (such as metaphor, ellipsis, etc.) were also kept as constant as possible. Finally, non-metered and non-rhymed poems account for a substantial share of 20th century poetry. For all these reasons, the modified versions that were relatively low in parallelistic features were also readily accepted as poems. Table 1 illustrates the steps of modifications we employed on our set of 40 German poems. In order to make these steps intelligible to a broader readership, we illustrate them on an English analogue which is based on the first stanza of the poem “Ah Sun-flower!” by William Blake. (For a detailed German example of all differences between versions A and E, see Supplementary Materials in Menninghaus et al., 2017.)
Professional speaker.
The 200 stimuli overall (40 poems in five versions each) were recited by a professional speaker who is a trained actor, certified voice actor and speech trainer. The digital recordings (sampling rate 44.1 kHz, amplitude resolution 16 bits) were made in a sound-attenuated studio. The speaker was instructed to recite all poem versions with a relatively neutral expression and without placing too much emphasis on a personal interpretation Errors during reading were corrected by repeating the respective parts of the poems.
In order to obtain speaker-independent evidence of poem-based pitch and duration recurrences, we opted for several control conditions. One control condition involved computer-generated voices in a text-to-speech application (natively implemented in MAC OSX 10.11). All 5 versions of the 40 poems were synthesized using a male and a female voice at a syllable rate of ~4 syllables/s. We used the voices called ANNA (female) and MARKUS (male) in their standard-settings. The algorithm first translates text input into a phonetic script and then synthesizes each word using pre-recorded speech templates. Global prosodic features are applied by default and triggered by punctuation (e.g. question intonation is triggered by question mark). We decided to use these voices on the basis of the overall acceptable voice quality which was judged to be superior to many other text-to-speech synthesis applications, including some that allow for a more detailed control over acoustic-phonetic properties.
Another control involved 10 nonprofessional native German speakers (4 males, 6 females, mean age 29 ± 7 years) who read a feasible subset of 8 original poems (version A) and 8 modified versions (version E) of these poems. Participants in this production study were recruited from the participant pool of the Max Planck Institute for Empirical Aesthetics and received monetary compensation. They were asked to read the 16 poems in randomized order, naturally and in a standing position in a sound-attenuated recording studio (sampling rate 44.1 kHz, amplitude resolution 32 bits). They were also asked to avoid strong expressivity in their renditions. Errors during reading were corrected by repeating the respective parts of the poems. Prior to acoustic analyses, the recorded poems were modified in order to match the written versions (replacement of erroneous passages, removal of non-text-based additions). All experimental procedures were ethically approved by the Ethics Council of the Max Planck Society and were undertaken with the written informed consent of each participant. On average, the recording session took about one hour per participant.
Our acoustic analyses focused on the primary acoustic cue of linguistic pitch, i.e., the fundamental frequency of sonorous speech parts (F0, [ 29 , 30 ]), and on syllable duration.
Digitized poem renditions were automatically annotated using the Munich Automatic Segmentation system (MAUS), and annotation grids were imported into the phonetic software application PRAAT ([ 31 ]). Annotation was based on syllabic units; this was motivated by the observation that the syllable is a core linguistic unit in poetry ([ 32 ]) and the minimal unit in the prosodic hierarchy ([ 33 ]). The annotation was manually inspected and corrected by a trained phonetician and native speaker of German. Corrections included marking silent periods larger than 200 ms as pauses and shifting syllabic boundaries to zero crossings in order to arrive at consistent cuttings, as is usual in phonetic annotation.
In principle, the measure of recurrence we used is entirely independent of manual chunking or manual pre-processing. However, we decided for an approach that includes a “hand-made” control and fine-tuning of the exact syllable boundaries; we expected that this additional effort could improve the correlations between our statistical textual measure and the data for subjective perception. (A quantification of the degree to which our manual intervention actually improved the correlations was, however, beyond the scope of our paper.)
For all analyses, our syllable-based pitch estimation followed the approach in Hirst [ 34 ]. In a first pass, the fundamental frequency of the entire poem was calculated using an autocorrelation approach, with the pitch floor at 60 Hz and the pitch ceiling at 700 Hz. The 25% and 75% quartiles (Q25 and Q75) of pitches were identified and used for determining the pitch floor and ceiling in the second pass of fundamental frequency estimation. The second-pass pitch floor was 0.25 * Q25, while the pitch ceiling was 2.5 * Q75. Pitch extraction is illustrated in Fig 1 . Averaged pitch and duration values for each speaker/voice are provided in Table 2 .
A. Top: The digitized speech signal was annotated, using syllabic units (example poem: August von Platen [1814], Lass tief in dir mich lesen). Bottom: The pitch contour was obtained by a two-pass fundamental frequency (F0) estimation. From sonorous parts, the mean pitch at three measurement positions was calculated. B. Mean pitch values were mapped onto semitones, using the MIDI convention, and syllable duration was mapped onto musical length. For illustration purposes, the resulting notation was shifted two octaves up. C. Discrete pitch and duration values were subjected to autocorrelation analyses. Apart from an overall measure of autocorrelation strength, the study focused on autocorrelation values at lags that correspond to poetic structure, such as (all) individual lines, rhyming lines, and stanzas.
Each speaker in the control group of 10 speakers produced a subset of 16 poems (8 original [A] versions, 8 modified [E] versions). The syllable rate is computed as the number of non-silent syllabic units per time unit.
Speaker | Mean syllable pitch [Hz ± SD] | Mean syllable rate [1/sec ± SD] | Number of poems [ ] |
---|---|---|---|
99 ± 6 | 3.0 ± 0.2 | 200 (40x A, B, C, D, & E, respectively) | |
102 ± 2 | 4.1 ± 0.2 | 200 | |
173 ± 3 | 4.1 ± 0.2 | 200 | |
196 ± 3 | 4.2 ± 0.3 | 16 (8x A, 8xE) | |
217 ± 5 | 2.9 ± 0.2 | 16 | |
218 ± 6 | 3.3 ± 0.2 | 16 | |
129 ± 3 | 3.1 ± 0.2 | 16 | |
214 ± 7 | 3.4 ± 0.3 | 16 | |
242 ± 3 | 2.9 ± 0.2 | 16 | |
227 ± 5 | 3.6 ± 0.2 | 16 | |
141 ± 2 | 3.6 ± 0.2 | 16 | |
137 ± 3 | 4.0 ± 0.2 | 16 | |
126 ± 2 | 3.2 ± 0.2 | 16 |
Following Patel et al. [ 23 ], we computed the mean pitch for each syllable across the three measurement positions beginning, middle, and end. In rare cases (<0.2% of all data points), pitch could not be determined and was interpolated based on the pitches of neighboring syllables. In addition to syllable-based pitch information, we also calculated the physical duration of each syllable. We excluded pauses from the pitch and duration analyses.
Pitch and duration values were discretized, using MIDI conventions, in order to arrive at a more music-analogous basis for subsequent pitch analyses. This implied that raw pitch values (in Hz) were transformed into semitones on the MIDI scales with numeric values ranging from 21 to 108, using the following formula:
with d = MIDI pitch value and F = raw pitch (in Hz).
Syllable durations were mapped onto musical note durations, with the simplifying assumption that a whole note corresponds to 1 s. The smallest note duration values were mapped to a 16th note, which thus corresponded to a minimal syllable duration of 62.5 ms. For illustration purposes only, the MIDI pitch values were transposed two octaves up (see Fig 1B ).
We performed autocorrelation analyses on the time series of syllable/note pitches and syllable/note durations. The discrete autocorrelation R at lag L for the signal y ( n ) was calculated as
Autocorrelations were determined for lags L = 0 up to 90% of the length of the respective time series.
The significance of each autocorrelation value was estimated using a permutation analysis. For this purpose, autocorrelations were computed for 10,000 randomly shuffled syllable sequences. For each time lag, the absolute value of the autocorrelation was compared to the autocorrelation value for the original syllable sequence of each poem. If the autocorrelation value for the shuffled sequence was smaller than the autocorrelation value for the original sequence in more than 95% of all 10,000 permutations (α < 0.05), the hypothesis that the autocorrelation value of the original sequence equaled the autocorrelation value of the random sequence at the corresponding lag was rejected, and the autocorrelation value in question was then considered “significant”. Only significant autocorrelations were used for subsequent analyses.
For the main study, we determined the average distances (in number of syllables) between syllables of successive stanzas, rhyming lines only, and all verse lines, as we hypothesized that pitch and duration contours yield recurrent patterns across the main compositional building blocks of the poems (most prominently the stanza). For the modified poem versions from which rhymes were removed, distances (in number of syllables) between rhyming lines were determined based on where the rhyming word would have been found, had it not been replaced by a non-rhyming counterpart. Subsequently, we determined the mean autocorrelation value of each poem rendition for these three textual units (all individual lines, rhyming lines only, stanzas).
Ratings of subjectively perceived melodiousness (given on a 7-point scale) were collected for all poem versions (A-E). Because our hypothesis considers the original version A as the most melodious one and we were interested in the hypothetical decrease of perceived melodiousness relative to this version, we consistently used version A as anchor version against which we separately compared all other versions.
Procedure and experimental setup was similar to the study reported by ([ 8 ]). Overall, 320 students (224 women and 96 men) participated (80 per the individual comparisons of version A with versions B, C, D, and E), with a mean age of 23.6 years (SD = 4.3, min = 18, max = 42). Each participant listened to two versions (original and one of the modified versions) of four poems recited by the professional speaker and was subsequently asked to rate these two versions on several 7 point-scales capturing emotional responses (positive and negative affect, sadness, joy and being-moved) and dimensions of aesthetic evaluations (melodiousness, beauty, and liking). For the present study, we exclusively focus on the melodiousness ratings.
The mean melodiousness ratings obtained for the original version (A) across the four different data sets did not significantly differ from one another ( F (3,117) = 2.05, p = 0.110; see Fig 2 ). We therefore collapsed the ratings for version A across the four data sets.
Notably, melodiousness ratings for the original poems (version A) did not differ between experiments.
Correlation analyses.
Correlation analyses were all based on Spearman’s rank correlation coefficients (carried out in the statistical software package R, Version 3.4, The R Foundation, Vienna, 2017).
We report all results as a mixed-effect ANOVA with F -values that were estimated by the lmerTest package ([ 35 ]). Post hoc analyses were calculated based on the multcomp package in R ([ 36 ]) and consisted of Bonferroni-adjusted t -tests with z -transformed t -values.
For the dependent variable mean autocorrelation (for the three textual units of line, rhyme, and stanza) we calculated separate models for (a) the 200 renditions by the professional speaker and the two synthetic voices and for (b) the 16 renditions by the 10 nonprofessional speakers. All models included poem as a random variable and the fixed effects poem version (A to E for the professional speaker and computer voice models, and A vs. E for the nonprofessional speaker model), acoustic measure (pitch or duration), and speaker (professional speaker vs. the two synthetic voices for model (a) and 10 nonprofessional speakers for model (b), as well as all possible interactions). Finally, the models also included the fixed effect textual unit (all lines, rhyming lines, stanzas).
In order to further examine the relationship between autocorrelations and melodiousness ratings, we calculated additional models for all poem versions as recited by the professional speaker, with mean ratings as the dependent variable. The model included the covariate mean autocorrelation (together with the fixed effects acoustic measure and textual unit ).
We finally were interested whether autocorrelations would vary dependent on whether the respective poems were set to music or not. For this reason, we calculated a model for the original poems with the dependent variable mean autocorrelation (across stanzas) and the fixed effect set to music (1 = yes, 0 = no). We additionally ran mixed-effects logistic regressions ([ 37 ]) with set to music as the dependent variable and autocorrelation (across stanzas) as the fixed effect.
Focusing on the original poems only, we first correlated the mean melodiousness ratings obtained from the four different data sets involving four different groups of participants (see “Participants”) with the mean autocorrelation scores of pitch and duration across stanzas as extracted from the rendition of these poems by a professional speaker. There was a significant correlation effect for pitch-based autocorrelations (rho = 0.31, t = 2.00, p < 0.05) but not for duration-based autocorrelations (rho = 0.19, t = 1.32, p = 0.20, see Fig 3 ). Importantly, thus, our statistical measure of the melodiousness of speech captures objective differences of the acoustic rendition of different poems that are predictive of the subjective impressions of melodiousness during listening to these poems.
Pitch-based autocorrelations were significantly correlated with melodiousness ratings.
We further analyzed whether pitch- and duration autocorrelation values varied depending on the respective meter of the original poems in our corpus. Performing a two-sample t-test, we first compared the mean ratings obtained for the iambic poems (N = 27) with those obtained for the trochaic ones (N = 13). The test revealed no significant difference in melodiousness ratings for the two groups of poems ( t = 0.61, p = 0.54). Next, we examined whether there was an interaction between meter in general (be it iambic or trochaic) and acoustic property (pitch, duration) with respect to the mean autocorrelations across stanzas. The corresponding model did not show a significant interaction ( F (1,38) = 0.74, p = 0.39) and hence no effect of meter ( F (1,38) = 0.01, p = 0.92). That is, neither melodiousness ratings nor autocorrelation scores depend on the meter of the original poems.
Professional speaker and synthetic voices.
Mean autocorrelations decreased as a function of poem version ( F (4,3471) = 71.65, p < 0.001): values were highest for the original versions (A) and lowest for the most modified versions (E). This effect interacted with textual unit ( F (8,3471) = 7.98, p < 0.001): modifications affected most strongly autocorrelations computed across stanzas ( Fig 4B ). The analysis also yielded a main effect of speaker ( F (2,3471) = 25.97, p < 0.001), with higher autocorrelations for the professional speaker than for either of the synthetic voices. The main effect of textual unit ( F (2,3471) = 55.12, p < 0.001) reflected the following scaling of autocorrelations: all lines<rhyming lines only<stanzas. This effect crucially depended on acoustic measure dimension ( textual unit x acoustic measure : F (2,3471) = 16.78, p < 0.001) and was further influenced by speaker ( speaker x acoustic measure x textual unit : F (4,3471) = 14.12, p < 0.001). Notably, the scaling all lines<rhyming lines only<stanzas particularly held for pitch and for the professional speaker ( Fig 4A ). No other main effects or interactions depended on speaker (all F s < 2, p > 0.19).
A. The scaling of autocorrelations with textual unit (all lines<rhyming lines only<stanzas) crucially depended on acoustic dimension and on speaker (PS: professional speaker, SV: synthetic voice). B. Illustration of the poem version effect for the professional speaker. The strongest effect is seen for pitch-based autocorrelations across stanzas. Error bars indicate standard errors of the mean.
Poem modification led to decreased autocorrelation values ( F (1,833) = 191.40, p < 0.001), and autocorrelation values scaled with textual unit as in the previous analysis (i.e., all lines<rhyming lines only<stanzas; main effect textual unit : F (1,833) = 49.67, p < 0.001). The significant textual unit x acoustic measure interaction ( F (2,833) = 14.87, p < 0.001) revealed that this scaling order only held for pitch. Post hoc analyses showed that, for pitch, the stanzas showed larger autocorrelation values than the rhyming lines ( z = 3.96, p < 0.01). Inversely, for duration, the stanzas showed smaller autocorrelation values than the rhyming lines ( z = -3.46, p < 0.01). This effect further depended on poem version and was driven by the original poems (significant interaction textual unit x acoustic measure x poem version : F (2,833) = 7.16, p < 0.001). The stanza>rhyming lines relation held for pitch ( z = 4.53, p < 0.001), and the rhyming lines>stanzas relation held for duration ( z = -3.60, p < 0.001), but no differences were found for the modified poems (pitch: z = 0.75, p = 0.91; duration: z = −1.12, p = 0.53; Fig 5 ). The main effects of speaker ( F (9,833) = 2.93, p < 0.05) revealed speaker-dependent differences in the autocorrelations. Importantly, the effect of speaker did not show significant interactions with any of the other effects (all F s < 2, p > 0.12).
Overall, autocorrelations are higher for original than for modified poems, but differ depending on acoustic dimension (pitch or duration) and textual unit (all lines, rhyming lines only, stanzas). Error bars indicate standard errors of the mean.
Correlating the melodiousness ratings obtained for all 200 poem versions (40 poems in five variants each) with the autocorrelation scores of each of these poems versions revealed that melodiousness ratings strongly depended on the poems’ modification ( F (4,156) = 42.41, p < 0.001), with decreasing melodiousness ratings for increasing levels of poem modification (see Fig 2 ). The linear decrease of melodiousness ratings with increasing modification [coding version A as 0, version B and C as 1, and version D and E as 2 and 3 respectively] is substantiated by a significant Spearman correlation (rho = −0.59, p < 0.001; based on the mean melodiousness ratings per poem version).
In the model comprising the mean autocorrelation values of all poem versions as fixed effect, there was a significant interaction of textual unit and mean autocorrelation ( F (2,1104) = 2.32, p < 0.05) that depended on poem modification, as seen in the three-way interaction of textual unit x poem modification x mean autocorrelation ( F (8,1104) = 2.23, p < 0.05). The decomposition of these interactions revealed that mean autocorrelations at line-lags never correlated with melodiousness ratings (i.e. independent of modification, rho = 0.06, t = 1.22, p = 0.23), whereas mean autocorrelations at rhyme-lags (rho = 0.11, t = 2.04, p < 0.05) and even more so at stanza-lags (rho = 0.21, t = 4.34, p < 0.01) correlated positively with melodiousness ratings, with the original poem version showing the by far strongest correlation with the autocorrelation measure. Thus, we observe an overall positive correlation of mean autocorrelations and melodiousness ratings relatively independent of modification. This finding indicates that our statistical measure of melodiousness captures statistical differences of the phonetic signal that correlate with perceptual differences not just for the prototypical rhymed and metered poems but likewise for their far less prototypical versions. Although not substantiated by a significant interaction, we looked at correlations between ratings and autocorrelations at stanza-lags separately for pitch- and duration-based autocorrelations ( Fig 6 ). These correlations proved to be significant for both pitch (rho = 0.10, p < 0.05) and duration (rho = 0.14, p < 0.01).
These correlations involve all poem versions.
Whereas the removal of ongoing meter required systematic changes of the wording or at least the word order throughout all lines of the poems, the other three modifications affected only individual words and were altogether very subtle. Given that our data also show substantial individual variance regarding the ratings for all versions, it is fairly remarkable that we did find a significant correlation between autocorrelation scores and melodiousness ratings for all poem versions. Anticipating that this correlation should be far more pronounced when looking at the end points of the experimental modifications only, we computed an additional correlation analysis for versions A and E only. Results strongly confirmed this expectation: The correlations for both pitch (rho = 0.22, p < 0.05) and duration (rho = 0.36, p < 0.05) were highly significant when looking at the pooled data from versions A and E. By contrast, when looking at the pooled data from versions B, C, and D, correlations did not reach significance, neither for pitch (rho = 0.05, p = 0.62) nor for duration (rho = −0.03, p = 0.74).
Whether or not a poem has been set to music ( musical setting 1 or 0) correlates with mean autocorrelations ( F (1,399) = 18.68, p < 0.001). This effect differs depending on the textual unit (all individual lines, rhyming lines only, or stanzas; interaction musical setting x textual unit : F (2,427) = 2.90, p = 0.05). Poems set to music particularly show higher autocorrelations across stanzas ( z = 4.50, p < 0.001, Fig 7 ; all other comparisons z < 2, p > 0.19). A mixed-effects logistic model further confirmed that musical settings are predicted by overall autocorrelation across stanzas ( z = 3.40, p < 0.001). Again, we found a stronger predictive effect for pitch ( z = 2.84, p < 0.01) than for duration ( z = 2.27, p < 0.05). Notably, this finding was obtained based solely on the original poems, and hence independent of any experimental modification.
Autocorrelation values are higher for poems that have been set to music than for poems that have not been set to music. This particularly holds for autocorrelations across stanzas. Error bars indicate the standard error of the mean.
The most seminal finding of our study is that pitch contours of original poems show a highly recurrent and largely speaker-independent structure across stanzas . These recurrent pitch contours are an important higher order parallelistic feature that had previously escaped attention both in linguistics and in literary scholarship. Crucially, the quantitative measure of pitch recurrence across stanzas correlated significantly with listeners’ melodiousness ratings, lending strong support to our hypothesis that relevant and distinctive dimensions of melodic contour in spoken poetry can indeed be approximatively captured by a fairly simple and abstract autocorrelations measure. Moreover, the fact that mean subjective melodiousness ratings for the 40 original poems nearly converged for four independent groups of participants, is already in itself a remarkable finding that strongly hints at some objective correlate of these ratings.
As anticipated, pitch and duration autocorrelations decreased as other parallelistic properties of the poems were experimentally removed. Importantly, across all poem versions, higher degrees of pitch recurrences predict higher subjective melodiousness ratings, and they do so already for the original (unmodified) poems only, independent of any experimental modification we performed. Duration recurrences also predict melodiousness ratings when analyzed across all poem versions, but not when only the original poems are considered. The absence of an effect of duration autocorrelation for the original poems may suggest that the construct of melody in spoken poetry is mainly based on discrete pitches, in close resemblance to music.
Overall, this pattern of finding suggests that our measure of melodic recurrence––autocorrelations of pitch and duration––is indeed not only suited for analyzing classical metered and rhymed poems of the type that was preeminent in 19 th century Europe. From a technical point of view, the measure can easily be applied to all types of speech. It is widely acknowledged that every type of speech has an inherent rhythm and beat (e.g. [ 20 , 21 ]), i.e. a (quasi)-regular distribution of phonetic (speech sound and prosodic) features in time. Our measure is well-suited to capture these distributions in future research. Given that we found an overall positive correlation of melodiousness ratings with autocorrelations across different levels of modifications (i.e. relatively independent of meter and rhyme), we expect that the perceptual consequences of melodic recurrence also hold beyond poetry, albeit to a lesser degree.
Furthermore, pitch recurrences were also predictive of whether or not specific poems were set to music. Our study is thus the first to operationalize the phantom of a “song”-like poetic speech melody of spoken poems by recourse to a measure that can quantify it. It is also the first to empirically illustrate the powerful effects poetic speech melody can exert on the aesthetic evaluation of poetry by nonprofessional listeners as well as on decisions of composers to set particular poems to music.
At first sight, consistencies in syllable pitch (and duration) structure within larger constituents of spoken language may not seem surprising, as previous research on intonation contours and linguistic rhythm (e.g. [ 38 , 39 ]) has revealed that phrase endings are prosodically marked, in that, for instance, pitches yield a downward movement or a falling contour, and that this prosodic marking may co-occur with phrase-final lengthening ([ 40 , 41 ]).
Furthermore, prosody is differently treated in trochaic and iambic meter. The inverse strong-weak and weak-strong patterning of syllables characterizing trochees and iambs is accompanied by prosodic cues that are analogously interpreted across different languages and even mark an important distinction for a nonhuman species ([ 42 ]). Stated in the so-called iambic/trochaic law ([ 43 , 44 ]), the strong-weak patterning in trochees is brought about by higher intensity and pitch in strong and lower intensity and pitch in weak syllables. On the other hand, the weak-strong patterning in iambs corresponds to a difference in syllable duration, with a relatively short syllable in weak position and a relatively long syllable in strong position.
Thus, to a certain degree, metrical structure alone already supports a regular patterning of pitches and durations. This may certainly be one factor that explains why original poems show high autocorrelations based on these measures. However, this explains neither the correlations of pitch (and partly also of duration) autocorrelations with melodiousness ratings, nor the relationship between pitch structure and the likelihood of a poem being set to music. Since a bit more than two thirds of our original poems feature iambic and the remaining ones have trochaic meter, the aforementioned duration emphasis of iambs should have prevailed over the pitch emphasis of trochees and should in sum total have resulted in a stronger duration than pitch effect. However, we here report precisely the opposite, namely, a stronger predictive power of the pitch autocorrelations. Moreover, we did not find any significant correlations between autocorrelations values and meter (be it iambic or trochaic).
We therefore suggest that the construct of melody in spoken poetry is neither a mere phantom implicitly endorsed by the longstanding tradition to call poems “songs” nor a mere side effect of metrical structure. Rather, it is a measurable, quantifiable entity of its own that explains effects that are not otherwise predictable by existing paradigms and methods of analyzing linguistic prosody.
To be sure, we are fully aware that our analyses cannot provide a full theory of melody or melodic features (for research in this direction, see e.g. [ 45 , 46 ]). Clearly, the autocorrelations scores are exclusively linked to degrees of repetition and not to specific harmonic qualities of the tone sequences. However, for all its abstractness, the predictive power of this statistical measure both for subjectively perceived melodiousness and for decisions of composers to set specific poems to music strongly suggests that the measure does have a bearing on genuine aesthetic perception.
We certainly acknowledge that melody in music and melody in speech differ in certain aspects. For instance, the pitch range in speech is far narrower than in musical melody ([ 47 , 48 ]). Nevertheless, the proposed measure of pitch- and duration-based autocorrelations appears to be a fruitful measure that at least approximately captures melodic properties of both music and language.
As predicted by our theoretical considerations, similar pitch sequences were most prominently found across stanzas. After all, it is primarily the stanza pattern that is consistently repeated in poems, whereas individual lines vary frequently in the number of syllables. Since recurrent meter and rhyme patterns have been shown to enhance prosodic fluency ([ 14 ]), it is likely that recurrent melodic contours also contribute to such parallelism-driven fluency effects which, in turn, enhance aesthetic appreciation ([ 49 ]). In fact, we propose that our results can largely be explained by reference to the ease-of-processing hypothesis of aesthetic liking.
Crucially, the stanza effect in our study turned out to be consistently independent of the speaker. It was most pronounced when poems were recited by humans (professional or nonprofessional). Surprisingly––and highlighting the independence of poetic speech melody from the actual rendition by any speaker and hence its strong reliance on an inherent textual property––, even the poem versions that were recited by synthetic voices confirmed the melody effect at the stanza level. The ratings of perceived melodiousness likewise correlated most strongly with stanza-based pitch and duration autocorrelations.
The strong effect of the poem modifications on melodiousness as spontaneously rated by non-expert listeners suggests that poetry recipients are highly sensitive to perceiving multiple co-occurring and strongly interacting parallelistic patterns at a time and that they are also capable of rapidly integrating these patterns into a complex percept. Such automatic detection and integration of multiple optional patterns of poetic parallelism can be conceived as an analogue to the low-level perception of multiple symmetries and other autocorrelations in complex visual aesthetics ([ 50 ]).
Moreover, parallelistic patterning has been shown to enhance the memorability of poetic language ([ 51 ]). As genuine musical melodies clearly support the memorability of the lyrics underlying them, it is worth investigating the extent to which poetic speech melody also increases verbatim recall and potentially also the privileged storage of poems or other texts in memory ([ 51 ]).
Finally, our analyses of poetic speech melody reveal a hitherto unknown reason for why some poems have been set to music while others have not: the higher the degree of pitch recurrences of corresponding syllables across the stanzas of a given poem, the higher the likelihood that it has been set to music. Thus, our findings provide an empirical basis for the view that melodic aspects of poetry are inherent properties of the verbal material itself ([ 52 ]), and that an intuitive awareness of these properties seems to guide composers in finding the “right” musical melody ([ 53 ]).
Regarding the relationship of linguistic prosody to music, our methods advance attempts by Halle and Lerdahl [ 54 ] to introduce generative methods for “text setting” (i.e., setting texts to music) by capitalizing on the fact that linguistic and musical prosody are similar ([ 9 , 55 , 56 ]). Our findings––particularly the predictive power of stanza-related autocorrelations for musical settings––suggest that composers are not only aware of the relationship between linguistic prosody and music, but also have the skills to implement the transformation that Halle and Lerdahl [ 54 ] described.
In his novel War and Peace ([ 57 ]), Tolstoy evocatively refers to this notion of a genuine inherent melodiousness of poetic language: “The uncle sang […] in the full and naive conviction that the whole meaning of a song lies in the words only and that the melody comes of itself, and that […it] exists only as a bond between the words.” In the end, this is exactly our finding: there is, indeed, a melody emerging from the words only––from the process of selecting and combining them––, and this auto-emergent melody bestows an additional musical coherence on the entire word sequence.
We certainly acknowledge that the relation between language and music is unlikely always to be as straightforward as our analyses of a specific type of poetry suggest. For instance, prose, too, can be set to music (operatic recitative), and some texts set to music may not show pronounced musical contours (monotonic chanting in certain religious traditions). Furthermore, many poems do not feature any sustained rhyme and/or metrical pattern and/or no stanza structure (e.g. [ 24 ]). However, even in these cases, our measure of pitch and duration recurrence may well be able to shed light on the relations between intrinsic language-dependent intonation and musical melody as well as between intrinsic linguistic rhythm and musical beat.
Summing up, our data strongly support the notion that the spontaneous recognition of recurrent melodic patterns extends well beyond music proper and the expectations of tonal harmony with which they are associated in music. Our study shows that spoken texts show, in their compositional entirety, a genuine and consistent patterning of recurrent pitch and duration contours, that a melodiousness of this type can be captured statistically using the same metrics across acoustic domains, and that recipients readily project their intuitive percept of inherent language-based melody onto melodiousness ratings in a way that is highly consistent with our statistical measure of melodiousness. Thus, our study turns the phantom of a poetic speech melody in spoken texts into a non-metaphorical, unquestionably real and measurable entity.
Finally, both classical ([ 58 ]) and modern poetics ([ 13 ]) suggest that poetic language is only gradually and not categorically different from ordinary language. Seen in this light, the perceptual sensitivity to poetic speech melody that we report in this study is unlikely to be exclusively acquired through repeated exposure to poetry. Therefore, we expect the construct we introduce in this study to be helpful in making progress on other issues that have thus far remained fairly elusive, specifically melody-like structures in rhetorical speeches, spoken religious liturgy and other types of ritual language that are rich in parallelistic structures. The measure may also be helpful for making progress on the difficult issue of “prose rhythm” ([ 59 ]), provided that in prose, too, higher order recursive structures can be identified that are analogues, if only less rigid ones, to the lines and stanzas of poetry, and hence can serve as reference units for more fine-grained autocorrelation analyses.
We wish to express our thanks to Alexander Lindau for his help with the recordings of the nonprofessional speakers and Andreas Pysiewicz for his support in recording entire poem corpus. We also thank Anna Roth, Miriam Riedinger, and Julia Vasilieva for their support during data preprocessing and text annotation, and Julia Merrill for her support on questions regarding the music-language interface. Finally, we are grateful to Nicola Leibinger-Kammüller for directing our attention to Tolstoy’s remark about an auto-emergent melody of verbal sequences and to all colleagues who shared inspirations and comments regarding our methods: Eugen Wassiliwizky, Michaela Kaufmann, Pauline Larrouy-Maestri, Elke Lange, Alessandro Tavano, and David Poeppel.
The research presented here was funded by the Max Planck Society.
Research project
All languages use melody in speech, primarily via rises and falls of the pitch of voice. Such pitch variation is pervasive, offering a wide spectrum of nuance to sentences – an additional layer of meaning. For example, saying “yes” with a rising pitch implies a question (rather than an affirmation). Melody is essential for communication in social interaction.
Languages employ melody in diverse ways to convey different layers of meaning in speech. In languages like Standard Dutch, a wide range of sentence-level meanings is conveyed by pitch variation, such as asking questions, highlighting important information, signaling intention, and conveying attitude or emotion.
This, however, is not the only level at which melody is used to convey meaning. The majority of the world’s languages (60-70%) are tone languages, which use pitch variation to distinguish individual word meanings (e.g. shi means ‘yes’ with a falling tone but ‘stone’ with a rising tone). Tone language speakers are nevertheless able to use pitch variation to express sentence-level meanings , too.
Worldwide, tone languages vary widely . Two well-recognized differences are particularly relevant. One concerns the form of word-level melody; some languages mainly use pitch levels (high, mid, low) to distinguish words, while others employ pitch contours as well (e.g., rising vs. falling). Another concerns the function of word-level melody; in some languages, tones not only distinguish words, but also identify grammatical function (e.g. different tenses of a verb or different cases of a noun). Thus, in a tone language, multiple layers of information, both at the word and sentence level, are conveyed in the same melodic signal in speech.
While we recognize that these different layers of meaning as well as the behavioral and neuro-cognitive aspects of producing and interpreting them are tightly intertwined, how they are connected and how those connections may be manifested differently in the world's languages remains poorly understood. This project therefore proposes to address the following significant, yet, unresolved questions.
To address these questions, this interdisciplinary proposal includes well-controlled systematic comparisons of the way pitch variations relate to word- and sentence-level meanings in typologically different tonal systems. The aim of this research program is to understand the general and language-specific mechanisms that guide the production, comprehension, and neural processing of pitch variation in tone languages.
Sign up for our cheatsheet the sounds of american english (visual guide).
Intonation, the rise and fall of pitch in speech melody, is crucial in American English. It’s intertwined with rhythm, defining the language’s character.
YouTube blocked? Click here to see the video.
Video Text:
Today I’m going to talk about intonation. I’ve touched on this subject in various other videos without ever explicitly defining it. And today, that’s what we’re going to do. But I’m also going to reference these other videos, and I really encourage you to go watch those as well.
If you’ve seen my videos on word stress, then you’ve already heard me talk a little about pitch. Stressed syllables will be higher in pitch, and often a little longer and a little louder than unstressed syllables. And there are certain words that will have a stress within a sentence, content words. And certain words that will generally be unstressed, and those are function words. For information on that, I invite you to watch those videos.
Intonation is the idea that these different pitches across a phrase form a pattern, and that those patterns characterize speech. In American English, statements tend to start higher in pitch and end lower in pitch. You know this if you’ve seen my video questions vs. statements. In that video, we learned that statements, me, go down in pitch. And questions, me?, go up in pitch at the end. So these pitch patterns across a phrase that characterize a language are little melodies. that characterize a language are little melodies. for example, the melodies of Chinese. If you haven’t already seen the blog I did on the podcast Musical Language, I encourage you to take a look at that. It talks about the melody of speech.
Understanding and using correct intonation is a very important part to sounding natural. Even if you’re making the correct sounds of American English, but you’re speaking in the speech patterns, or intonation of another language, it will still sound very foreign.
Intonation can also convey meaning or an opinion, an attitude. Let’s take for example the statement ‘I’m dropping out of school and the response ‘Are you serious?’ Are you serious? A question going up in pitch conveys, perhaps, an open attitude, concern for the person. Are you serious? But, are you serious? Down in pitch, more what you would expect of a statement, are you serious? The same words, but when it is intoned this way, it is conveying a judgement. Are you serious, a negative one. I don’t agree that you should be dropping out of school. I’m dropping out of school. Are you serious? I’m dropping out of school. Are you serious? With the same words, very different meanings can be conveyed. So intonation is the stress pattern, the pitch pattern, of speech. The melody of speech. If you’ve read my bio on my website, you know melody is something I’m especially keen on, as I studied music through the master’s level. Yes, that was yours truly, thinking a lot about melody. Now, you know that in American English, statements will tend to go down in pitch.
Let’s look at some examples. Here we see two short sentences. Today it’s sunny. I wish I’d been there. And you can see for both of them, that the pitch goes down throughout the sentence. Here we have two longer sentences, and though there is some up and down throughout the sentences, for both sentences, the lowest point is at the end. I’m going to France next month to visit a friend who’s studying there. It’s finally starting to feel like spring in New York.
The software I used to look at the pitch of those sentences is called Pratt, and there’s a link in the footer of my website. So it’s at the very bottom of every page. I hope you’re getting a feel for how important intonation is to sounding natural and native in American English. I hope you’ll listen for this as you listen to native speakers, and that if you haven’t already done so, that you’ll go to my website and do that you’ll go to my website and do So you hear them several times to get the melody That’s it, and thanks so much for using Rachel’s English.
Back to Blog Roll
Enjoying the read? Pass it on!
English conversation practice: dirty to clean, better than ai [hint: yes], anyone can do this | master the american accent, hear the difference immediately..
Get the Top 3 Ways to Master the American Accent course absolutely free. Sign up today to unlock your American voice.
You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.
All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.
Original Submission Date Received: .
Find support for a specific problem in the support section of our website.
Please let us know what you think of our products and services.
Visit our dedicated information section to learn more about MDPI.
The melody of speech: what the melodic perception of speech reveals about language performance and musical abilities.
1.1. assessing musical abilities, 1.2. assessing pronunciation skills and the melodic perception of speech, 2. materials and methods, 2.1. participants, 2.2. educational status, 2.3. musical measurements, 2.3.1. musical background, 2.3.2. musical aptitude: advanced measures of music audiation, 2.3.3. singing ability, 2.4. language measurements, 2.4.1. language background, 2.4.2. language performance, 2.4.3. melodic language ratings, 2.5. short-term memory measurement, 2.6. testing procedure, 2.7. statistical analysis and procedure, 3.1. descriptives of the measurements, 3.2. statistical results 1: relationships among the selected variables (correlations and regression models), 3.3. statistical results 2: group differences for high vs. low melodic language perceivers (t-tests for independent samples), 3.4. statistical results 3: interactions between the musical status and the high and low melodic language perceivers on the language performance tasks (two-way anova), 4. discussion, 4.1. correlational analysis and regression: pronunciation, 4.2. melodic perception of languages and performance, 4.3. musical abilities, musical status, and the melodic perception of speech, 5. conclusions, supplementary materials, author contributions, institutional review board statement, informed consent statement, data availability statement, acknowledgments, conflicts of interest.
Abbreviation | Meaning |
---|---|
AMMA | Advanced Measures of Music Audiation |
AMMA rhythm | Rhythmic AMMA score |
AMMA tonal | Tonal AMMA score |
ES | Educational status |
High melodic LP | High melodic language perceivers |
Low melodic LP | Low melodic language perceivers |
Melodic P | Mean of the composite score of all five melodic ratings |
No of FL | Number of foreign languages spoken |
P | Perception |
PR | Pronunciation score |
PR total | Mean composite score of all five language performance measurements |
STM | Short-term memory |
Variables | Mean (M) | Standard Deviation (SD) |
---|---|---|
Melodic ratings for Chinese | 5.85 | 2.39 |
Melodic ratings for Japanese | 5.83 | 2.45 |
Melodic ratings for Russian | 5.73 | 2.24 |
Melodic ratings for Tagalog | 6.91 | 1.92 |
Melodic ratings for Thai | 4.70 | 2.11 |
| 5.80 | 1.37 |
Chinese pronunciation (PR) | 2.37 | 0.83 |
Japanese pronunciation (PR) | 4.82 | 1.38 |
Russian pronunciation (PR) | 3.60 | 1.35 |
Tagalog pronunciation (PR) | 2.36 | 1.18 |
Thai pronunciation (PR) | 1.64 | 0.73 |
| 2.96 | 0.89 |
AMMA rhythm | 28.70 | 4.26 |
AMMA tonal | 25.86 | 5.10 |
Melodic singing ability | 5.98 | 1.50 |
Rhythmic singing ability | 6.77 | 1.18 |
Short-term memory (STM) | 15.23 | 3.84 |
Variable | Melodic P | Melodic Singing Ability | Rhythmic Singing Ability | AMMA Tonal | AMMA Rhythm | STM | ES | No. of FL |
---|---|---|---|---|---|---|---|---|
PR total | 0.466 ** | 0.512 ** | 0.501 ** | 0.401 ** | 0.324 ** | 0.503 ** | 0.231 * | 0.503 ** |
Melodic P | 0.168 | 0.181 | 0.203 | 0.225 * | 0.235 * | 0.309 ** | 0.304 ** | |
Melodic singing ability | 0.964 ** | 0.434 ** | 0.446 ** | 0.244 * | 0.283 ** | 0.370 ** | ||
Rhythmic singing ability | 0.419 ** | 0.417 ** | 0.259 * | 0.254 * | 0.370 ** | |||
AMMA tonal | 0.789 ** | 0.120 | 0.127 | 0.208 | ||||
AMMA rhythm | 0.194 | 0.048 | 0.227 * | |||||
STM | 0.079 | 0.201 | ||||||
ES | 0.367 ** |
Predictor | Partial Correlation (pr) | p-Value |
---|---|---|
Step 1: R = 0.52, F(1, 80) = 30.25, p < 0.001 | ||
No. of FL (foreign lang.) | 0.52 | <0.001 |
Step 2: R = 0.65, F(1, 79) = 19.73, p < 0.001 | ||
No. of FL (foreign lang.) | 0.49 | <0.001 |
STM | 0.45 | <0.001 |
Step 3: R = 0.71, F(1, 78) = 12.41, p < 0.001 | ||
No. of FL (foreign lang.) | 0.46 | <0.001 |
STM | 0.44 | <0.001 |
AMMA tonal | 0.37 | <0.001 |
Step 4: R = 0.74, F(1, 77) = 8.79, p = 0.004 | ||
No. of FL (foreign lang.) | 0.41 | <0.001 |
STM | 0.42 | <0.001 |
AMMA tonal | 0.34 | 0.002 |
Melodic P. total | 0.32 | 0.004 |
Step 5: R = 0.77, F(1, 76) = 6.9, p = 0.010 | ||
No. of FL (foreign lang.) | 0.33 | |
STM | 0.40 | <0.001 |
AMMA tonal | 0.24 | 0.031 |
Melodic P. total | 0.34 | 0.002 |
Melodic singing ability | 0.29 | 0.010 |
Dependent variable: pronunciation (PR) total |
Variables | Low Melodic LP: Mean | Low Melodic LP: SE | High Melodic LP: Mean | High Melodic LP: SE | t | df | p | r |
---|---|---|---|---|---|---|---|---|
Chinese PR * | 2.11 | 0.12 | 2.62 | 0.12 | −3.02 | 84 | p < 0.003 | r = 0.31 |
Japanese PR * | 4.30 | 0.21 | 5.32 | 0.18 | −3.68 | 84 | p < 0.001 | r = 0.37 |
Russian PR | 3.20 | 0.19 | 3.97 | 0.21 | −2.75 | 84 | p < 0.007 | r = 0.29 |
Tagalog PR * | 1.98 | 0.15 | 2.72 | 0.19 | −3.02 | 84 | p < 0.003 | r = 0.31 |
Thai PR * | 1.33 | 0.09 | 1.93 | 0.11 | −4.21 | 84 | p < 0.001 | r = 0.42 |
PR total * | 2.58 | 0.12 | 3.33 | 0.13 | −4.24 | 84 | p < 0.001 | r = 0.42 |
Melodic singing ability | 5.72 | 0.24 | 6.23 | 0.21 | −1.60 | 84 | p = 0.11 | r = 0.17 |
Rhythmic singing ability | 6.57 | 0.18 | 6.96 | 0.18 | −1.53 | 84 | p = 0.13 | r = 0.16 |
AMMA tonal | 24.95 | 0.72 | 26.73 | 0.81 | −1.63 | 84 | p = 0.11 | r = 0.18 |
AMMA rhythm | 27.83 | 0.68 | 29.52 | 0.60 | −1.87 | 84 | p = 0.07 | r = 0.20 |
STM | 14.45 | 0.58 | 15.98 | 0.58 | −1.87 | 84 | p = 0.07 | r = 0.20 |
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
Christiner, M.; Gross, C.; Seither-Preisler, A.; Schneider, P. The Melody of Speech: What the Melodic Perception of Speech Reveals about Language Performance and Musical Abilities. Languages 2021 , 6 , 132. https://doi.org/10.3390/languages6030132
Christiner M, Gross C, Seither-Preisler A, Schneider P. The Melody of Speech: What the Melodic Perception of Speech Reveals about Language Performance and Musical Abilities. Languages . 2021; 6(3):132. https://doi.org/10.3390/languages6030132
Christiner, Markus, Christine Gross, Annemarie Seither-Preisler, and Peter Schneider. 2021. "The Melody of Speech: What the Melodic Perception of Speech Reveals about Language Performance and Musical Abilities" Languages 6, no. 3: 132. https://doi.org/10.3390/languages6030132
Article access statistics, supplementary material.
ZIP-Document (ZIP, 233 KiB)
Mdpi initiatives, follow mdpi.
Subscribe to receive issue release notifications and newsletters from MDPI journals
Leoš Janáček based vocal melodies in his operas on the concept of nápěvky mluvy (speech melodies)—patterns of speech intonation as they relate to psychological conditions—rather than on a strictly musical basis. He used such melodic motives, characterizing a specific person in a specific dramatic situation, in both vocal and orchestral parts, enabling him to integrate the two parts into a compact unit for the utmost dramatic effect.
This according to “ Význam nápěvků pro Janáčkovu operní tvorbu ” (The significance of speech melodies in Janáček’s operas) by Milena Černohorská, an essay included in Leoš Janáček a soudobá hudba (Leoš Janáček and contemporary music; Praha: Panton, 1963, pp. 77–80).
Janáček found the source of speech melodies in spoken phrases of people of various social and cultural backgrounds, recorded in real-life situations. During his ethnomusicological research in Moravia and Slovakia in 1920s, Janáček not only recorded songs and music, but also wrote down the melodies of dialogue fragments and of singers’ comments on specific songs.
Recently discovered autographs of Janáček’s fieldwork notes in the collection of the Etnologický ústav AV ČR, pracovišťě Brno with transcriptions of nápěvky mluvy were published in Janáčkovy záznamy hudebního a tanečního folkloru. I: Komentáře (Janáček´s records of traditional music and dances: I. Commentaries) by Jarmila Procházková (Brno: Etnologický ústav AV ČR , 2006).
Today is Janáček’s 160th birthday! Above, examples of nápěvky mluvy that he transcribed in Čičmany , Slovakia, on 20 August 1911; below, the finale of his Jenůfa , a work often cited for its use of the speech-melody concept.
Comments Off on Janáček and speech melodies
Filed under Opera , Romantic era
Tagged as Birthdays , Jenůfa , Leoš Janáček , Opera
Comments are closed.
Email Address
Article sidebar, main article content.
It is generally accepted that speech and melody are distinctive perceptual categories, and that one is able to overcome perceptual ambiguity to categorize acoustic stimuli as either of the two. This article investigates the speech-melody experience of listening to Cantonese popular songs (henceforth Cantopop songs), a relatively uncharted territory in musicological studies. It proposes a speech-melody complex that embraces native Cantonese speakers’ experience of the potentialities of speech and melody before they come into being. Speech-melody complex, I argue, does not stably contain the categories of speech or melody in their full-blown, asserted form, but concerns the ongoingness of the process of categorial molding, which depends on how much contextual information the listeners value in shaping and parsing out the complex. It follows, then, that making a categorial assertion implies a breakthrough of the complex. I then complicate speech-melody complex with the concept of “anamorphosis” borrowed from the visual arts, a concept that calls into question the signification of the perceived object by perspectival distortion. When reconfigured in the sonic dimension, “anamorphic listening,” I suggest, is less about at which point one listens to some “distorted” sonic object but more about one’s processual experience of negotiating the hermeneutic values in their different listening-ases. The listener engages, then, in the process of molding and remolding, creating and negating, the two enigmatic categories, creating new sonic objects along the way. Through my analysis of two Cantopop songs and interviews with native Cantonese speakers, I suggest that Cantopop may invite an anamorphic listening, and that more broadly, it serves as an important, yet thus far under-explored, genre to theorize about the relationships between music and language.
Copyright © 2019 by the Society for Music Theory. All rights reserved.
[1] Copyrights for individual items published in Music Theory Online ( MTO ) are held by their authors. Items appearing in MTO may be saved and stored in electronic or paper form, and may be shared among individuals for purposes of scholarly research or discussion, but may not be republished in any form, electronic or print, without prior, written permission from the author(s), and advance notification of the editors of MTO.
[2] Any redistributed form of items published in MTO must include the following information in a form appropriate to the medium in which the items are to appear:
This item appeared in Music Theory Online in [VOLUME #, ISSUE #] on [DAY/MONTH/YEAR]. It was authored by [FULL NAME, EMAIL ADDRESS], with whose written permission it is reprinted here.
[3] Libraries may archive issues of MTO in electronic or paper form for public access so long as each issue is stored in its entirety, and no access fee is charged. Exceptions to these requirements must be approved in writing by the editors of MTO, who will act in accordance with the decisions of the Society for Music Theory.
This document and all portions thereof are protected by U.S. and international copyright laws. Material contained herein may be copied and/or distributed for research purposes only.
A musical approach to speech melody.
We present here a musical approach to speech melody, one that takes advantage of the intervallic precision made possible with musical notation. Current phonetic and phonological approaches to speech melody either assign localized pitch targets that impoverish the acoustic details of the pitch contours and/or merely highlight a few salient points of pitch change, ignoring all the rest of the syllables. We present here an alternative model using musical notation, which has the advantage of representing the pitch of all syllables in a sentence as well as permitting a specification of the intervallic excursions among syllables and the potential for group averaging of pitch use across speakers. We tested the validity of this approach by recording native speakers of Canadian English reading unfamiliar test items aloud, spanning from single words to full sentences containing multiple intonational phrases. The fundamental-frequency trajectories of the recorded items were converted from hertz into semitones, averaged across speakers, and transcribed into musical scores of relative pitch. Doing so allowed us to quantify local and global pitch-changes associated with declarative, imperative, and interrogative sentences, and to explore the melodic dynamics of these sentence types. Our basic observation is that speech is atonal. The use of a musical score ultimately has the potential to combine speech rhythm and melody into a unified representation of speech prosody, an important analytical feature that is not found in any current linguistic approach to prosody.
It is common to refer to the pitch properties of speech as “speech melody” in the study of prosody ( Bolinger, 1989 ; Nooteboom, 1997 ; Ladd, 2008 ). However, is this simply a metaphorical allusion to musical melodies, or does speech actually have a similar system of pitch relations as music? If it does not, what is the nature of speech’s melodic system compared to that of music? A first step toward addressing such questions is to look at speech and music using the same analytical tools and to examine speech as a true melodic system comprised of pitches (tones) and intervals. This is the approach that we aim to implement and test in the present study. In fact, it was the approach that was adopted in the first theoretical treatise about English intonation, namely Joshua Steele’s An Essay Toward Establishing the Melody and Measure of Speech to be Expressed and Perpetuated by Peculiar Symbols , published in 1775. Steele laid out a detailed musical model of both the melody and rhythm of speech (we will only concern ourselves with the melodic concepts here). He represented syllabic pitch as a relative-pitch system using a musical staff and a series of “peculiar symbols” that would represent the relative pitch and relative duration of each spoken syllable of an utterance. The key innovation of Steele’s approach from our standpoint is that he attempted to represent the pitches of all of the syllables in the sentences that he analyzed. Another advantage of his approach is that his use of the musical score allowed for both the rhythm and melody of speech to be analyzed, both independently of one another and interactively.
This is in stark contrast to most contemporary approaches to speech melody in linguistics that highlight a subset of salient syllabic pitches and thereby ignore all the rest of the melodic signal in a sentence, assuming a process of interpolation between those salient pitches. Many such approaches are based on qualitative labeling of pitch transitions, rather than acoustic quantification of actual pitch changes occurring in an utterance. At present, no musical elements are incorporated into any of the dominant phonetic or phonological models of speech melody. These models include autosegmental metrical (AM) theory ( Bruce, 1977 ; Pierrehumbert, 1980 ; Beckman and Pierrehumbert, 1986 ; Gussenhoven, 2004 ; Ladd, 2008 ), the command-response (CR) model ( Fujisaki and Hirose, 1984 ; Fujisaki et al., 1998 ; Fujisaki and Gu, 2006 ), and the “parallel encoding and target approximation” model ( Xu, 2005 ; Prom-on et al., 2009 ). Perhaps the closest approximation to a musical representation is Mertens’ (2004) Prosogram software, which automatically transcribes speech melody and rhythm into a series of level and contoured tones (see also Mertens and d’Alessandro, 1995 ; Hermes, 2006 ; Patel, 2008 ). Prosogram displays pitch measurements for each syllable by means of a level, rising, or falling contour, where the length of each contour represents syllabic duration ( Mertens, 2004 ). However, this seems to be mainly a transcription tool, rather than a theoretical model for describing the melodic dynamics of speech.
Before comparing the three dominant models of speech melody with the musical approach that we are proposing (see next section), we would like to first define the important terms “prosody,” “speech melody,” and “intonation,” and discuss how they relate to one another, since these terms are erroneously taken to be synonymous. “Prosody” is an umbrella term that refers to variations in all suprasegmental parameters of speech, including pitch, but also duration and intensity. On the other hand, “speech melody” and “intonation” refer strictly to the pitch changes associated with speech communication, where “intonation” is a more restrictive term than “speech melody”. “Speech melody” refers to the pitch trajectory associated with utterances of any length. This term does not entail a distinction as to whether pitch is generated lexically (tone) or post-lexically (intonation), or whether the trajectory (or a part thereof) serves a linguistic or paralinguistic function.
While “speech melody” refers to all pitch variations associated with speech communication, “intonation” refers specifically to the pitch contour of an utterance generated post-lexically and that is associated with the concept of an “intonational phrase” ( Ladd, 2008 ). Ladd (2008) defines intonation as a linguistic term that involves categorical discrete-to-gradient correlations between pattern and meaning. Intonation differs from pitch changes associated with “tones” or “accents”, which are determined lexically and which are associated with the syllable. By contrast, paralinguistic meanings (e.g., emotions and emphatic force) involve continuous-to-gradient correlations ( Ladd, 2008 ). For example, the angrier someone is, the wider is the pitch range and intensity range of their speech ( Fairbanks and Pronovost, 1939 ; Murray and Arnott, 1993 ).
In this section, we review three dominant models of speech melody: AM theory, the CR model, and the parallel encoding and target approximation (PENTA) model. Briefly, AM theory only highlights phonologically salient melodic excursions associated with key elements in intonational phrases, including pitch accents and boundary tones ( Pierrehumbert, 1980 ; Liberman and Pierrehumbert, 1984 ). On the other hand, CR imitates speech melody by mathematically generating pitch contours, and connecting pitch targets so as to create peaks and valleys along a gradually declining line ( Cohen et al., 1982 ; Fujisaki, 1983 ). Finally, PENTA assigns a pitch target to each and every syllable of an intonational phrase. Each target is mathematically derived from a number of factors, including lexical stress, narrow focus, modality, and position of the syllable within an intonational phrase. The final pitch contour is then generated as an approximation of the original series of pitch targets, in which distance between pitch targets is reduced due to contextual variations ( Xu, 2005 , 2011 ).
The ToBI (Tone and Break Index) system of prosodic notation builds on assumptions made by AM theory ( Pierrehumbert, 1980 ; Beckman and Ayers, 1997 ). Phonologically salient prosodic events are marked by pitch accents (represented in ToBI as H ∗ , where H means high) at the beginning and middle of an utterance; the end is marked by a boundary tone (L–L%, where L means low); and the melodic contour of the entire utterance is formed by interpolation between pitch accents and the boundary tone. Under this paradigm, pitch accents serve to mark local prosodic events, including topic word, narrow focus, and lexical stress. Utterance-final boundary tones serve to convey modality (i.e., question vs. statement; continuity vs. finality). Pitch accents and boundary tones are aligned with designated stressed syllables in the utterance and are marked with a high (H) or low (L) level tone. In addition, pitch accents and boundary tones can further combine with a series of H and L tones to convey different modalities, as well as other subtle nuances in information structure ( Hirschberg and Ward, 1995 ; Petrone and Niebuhr, 2014 ; German and D’Imperio, 2016 ; Féry, 2017 ). Consequently, the melodic contour of an utterance is defined by connecting pitch accents and boundary tones, whereas strings of syllables between pitch accents are unspecified with regard to tone and are treated as transitions. AM is considered to be a “compositional” method that looks at prosody as a generative and combinatorial system whose elements consist of the abovementioned tone types. This compositionality might suggest a mechanistic similarity to music, with its combinatorial system of scaled pitches. However, the analogy does not ultimately work, in large part because the tones of ToBI analyses are highly underspecified at the pitch level; the directionality of pitch movement is marked, but not the magnitude of the change.
Fujisaki and Hirose (1984) and Fujisaki and Gu (2006) proposed the CR model based on the physiological responses of the human vocal organ. In this model, declination is treated as the backbone of the melodic contour ( Cohen et al., 1982 ; Fujisaki, 1983 ). Declination is a reflection of the physiological conditions of phonation: speech occurs during exhalation. As the volume of air in the lungs decreases, the amount of air passing through the larynx also decreases, as does the driving force for vocalization, namely subglottal pressure. This results in a decrease in the frequency of vocal-fold vibration. CR replicates this frequency change by way of a gradual melodic downtrend as the utterance progresses. In this model, the pitch range of the melodic contour is defined by a topline and a baseline. Both lines decline as the utterance progresses, although the topline declines slightly more rapidly than the baseline, making the overall pitch range gradually narrower (i.e., more compressed) over time. In addition to declination, tone commands introduce localized peaks and valleys along the global downtrend. Although tone commands do not directly specify the target pitch of the local peaks and valleys, they are expressed as mathematical functions that serve to indicate the strength and directionality of these localized pitch excursions. Both AM and CR are similar in that pitch contours are delineated by sparse tonal specifications, and that syllables between tone targets are treated as transitions whose pitches are unspecified. However, the two models differ in that tone commands in the CR model are not necessarily motivated by phonologically salient communicative or linguistic functions. These commands are only used to account for pitch variations in order to replicate the observed pitch contours. This difference thus renders the CR model largely descriptive (phonetic), rather than interpretive (phonological), as compared with AM theory.
PENTA ( Xu, 2005 ; Prom-on et al., 2009 ) takes an articulatory-functional approach to representing speech melody. It aims to explain how speech melody works as a system of communication. Recognizing the fact that different communicative functions are simultaneously conveyed by the articulatory system, PENTA begins with a list of these functions and encodes them in a parallel manner. Each syllable obligatorily carries a tone target. The resulting melodic movement for each syllable is generated as an approximation of a level or dynamic tone-target. The pitch target of each syllable is derived based on its inherent constituent communicative functions that coexist in parallel (e.g., lexical, sentential, and focal). Pitch targets are then implemented in terms of contextual distance, pitch range, strength, and duration. The implementation of each pitch target is said to be approximate, as pitch movements are subject to contextual variations. According to Xu and Xu (2005) , the encoding process can be universal or language-specific. In addition, this process can vary due to interference between multiple communicative functions when it comes to the rendering of the eventual melodic contour. In other words, how well the resulting contour resembles the target depends on factors such as contextual variation (anticipatory or carry-over, assimilatory or dissimilatory) and articulatory effort. PENTA is similar to the CR model in that the fundamental frequency ( F 0 ) trajectory of an utterance is plotted as “targets” based on a number of parameters. Such parameters include directionality of the pitch changes, slope of the pitch target, and the speed at which a pitch target is approached. Nonetheless, PENTA sets itself apart from CR and AM in that it establishes a tone target for every syllable, whereas CR and AM only assign pitch accents/targets to syllables associated with localized phonologically-salient events (e.g., pitch accents, boundary tones).
Perhaps the only contemporary system that combines rhythm and melody in the same analysis is Rhythm and Pitch, or RaP ( Dilley and Brown, 2005 ; Breen et al., 2012 ). While based largely on AM’s system of H’s and L’s to represent tones, Breen et al. (2012 , p. 277) claim that RaP differs from ToBI in that it “takes into account developments in phonetics, phonology and speech technology since the development of the original ToBI system.” Instead of using numbers to represent relative boundary strength on the “Breaks” tier in ToBI, RaP uses “X” and “x” to mark three degrees of prominence (strong beat, weak beat, and no beat), as well as “))” and “)” to mark two degrees of boundary strength. On the “rhythm” tier, strong beats are assigned to lexically stressed syllables based on metrical phonology ( Nespor and Vogel, 1986 ; Nespor and Vogel, 2007 ). In addition, the assignment of prominence follows the “obligatory contour principle” ( Leben, 1973 ; Yip, 1988 ) by imposing that prominent syllables must be separated from one another by at least one non-prominent syllable, as well as by differences in the phonetic realization of content vs. function words. Although RaP sets itself apart from other systems by acknowledging the existence of rhythm and beats (i.e., pockets of isochronous units in lengthy syllable strings) as perceptual parameters, it still treats rhythm and pitch as completely separate, rather than integrated, parameters, and makes no provision to analyze or account for potential interactions between syllabic duration and pitch.
While all of the linguistic models discussed here claim to represent speech prosody, the fact that speech rhythm is integral to speech prosody and that rhythm and melody interact is largely ignored. As such, these models are only successful at representing some aspects of speech prosody, but present limitations at capturing the larger picture. The use of musical notation to represent speech prosody offers several advantages over AM theory and PENTA. First, the use of the semitone-based chromatic scale provides for a more precise characterization of speech melody, compared to the impoverished system of acoustically unspecified H and L tones found in ToBI transcriptions. As pointed out by Xu and Xu (2005) , AM Theory is strictly a linear model in that the marking of one tone as H or L essentially depends on the pitch of its adjacent syllables (tones). It is hence impossible to analyze speech melody beyond the scope of three syllables under the AM paradigm. In addition, the use of semitones might in fact provide a superior approach to describing speech melody than plotting pitch movements in hertz, since semitones correspond to the logarithmic manner by which pitches (and by extension intervals) are perceived by the human ear, although the auditory system clearly has a much finer pitch-discrimination accuracy than the semitone ( Oxenham, 2013 ). In addition, musical notation can simultaneously represent both the rhythm and melody of speech using a common set of symbols, which is a feature that no current linguistic model of speech prosody can aspire to. As such, the use of musical notation not only provides a new and improved paradigm for model speech melody in terms of intervals, but it also provides a more precise and user-friendly approach that can be readily integrated into current prosody research to further our understanding of the correspondence between prosodic patterns and their communicative functions. Speech melody denoted by musical scores can be readily learned and replicated by anyone trained in reading such scores. As a result, transcribing speech prosody with musical notation could ultimately serve as an effective teaching tool for learning the intonation of a foreign language.
Finally, with regard to the dichotomy in linguistics between “phonetics” and “phonology” ( Pierrehumbert, 1999 ), we believe that the use of musical notation to represent speech melody should be first and foremost tested as a phonetic system guided by the amount of acoustic detail present in the observed melodic contours. These details presumably serve to express both linguistic and paralinguistic functions. To further understand the communicative functions of speech melody, the correspondence between specific prosodic patterns and their meaning would then fall under the category of phonological research, using the musical approach as a research tool. For example, the British school of prosodic phonology has historically taken a compositional approach to representing speech melody and its meaning, where melody is comprised of tone-units. Each tone-unit contains one of six possible tones ( Halliday, 1967 , 1970 ; O’Connor and Arnold, 1973 ; among others) – high-level, low-level, rise, fall, rise-fall and fall-rise – each of which conveys a specific type of pragmatic information. For example, the fall-rise often suggests uncertainty or hesitation, whereas the rise-fall often indicates that the speaker is surprised or impressed. The length of a tone-unit spans from a single word to a complete sentence. The “tonic syllable” is the essential part of the tone-unit that carries one of the six abovementioned tones. Stressed syllables preceding the tonic are referred to as “heads”; unstressed syllables preceding the head are referred to as “pre-heads.” Finally, unstressed syllables following the tonic are referred to as the “tail.”
The principal aim of the current study is to examine the utility of using a musical approach to speech melody and to visualize the results quantitatively as plots of relative pitch using musical notation. In this vocal-production study, we had 19 native speakers of Canadian English read aloud a series of 19 test items, spanning from single words to full sentences containing multiple intonational phrases. These sentences were designed to examine declination, modality, narrow focus, and utterance-final boundary tones. We decided to analyze these particular features because their correspondence to linguistic meaning is relatively well-defined and because their implementation is context-independent. In other words, melodic patterns associated with the test sentences remain stable when placed within various hypothetical social contexts ( Grice and Baumann, 2007 ; Prieto, 2015 ). We transcribed participants’ melodic contours into relative-pitch representations down to the level of the semitone using musical notation. The aim was to provide a detailed quantitative analysis of the relative-pitch properties of the test items, demonstrate mechanistic features of sentence melody (such as declination, pitch accents, and boundary effects), and highlight the utility of the method for averaging productions across multiple speakers and visualizing the results on a musical staff. In doing so, this analysis would help revive the long-forgotten work of Steele (1775) and his integrative method of representing both speech rhythm and melody using a common system of musical notation. A companion musical model of speech rhythm using musical notation is presented elsewhere ( Brown et al., 2017 ).
Participants.
Nineteen participants (16 females, mean age 19.8) were recruited from the introductory psychology mass-testing pool at McMaster University. Eighteen of them were paid a nominal sum for their participation, while one was given course credit. All of them were native speakers of Canadian English. Two thirds of the participants had school training or family experience in a second language. Participants gave written informed consent for taking part of the study, which was approved by the McMaster Research Ethics Board.
Participants were asked to read a test corpus of 19 test items ranging from single words to various types of sentences, as shown in Table 1 . This corpus included declarative sentences, interrogatives, an imperative, and sentences with narrow focus specified at different locations. The purpose of using this broad spectrum of sentences was to analyze different prosodic patterns in order to construct a new model of speech melody based on a syllable-by-syllable analysis of pitch.
TABLE 1. Sentences in the test corpus.
In addition to examining the melody of full sentences, we used a building-block approach that we call a “concatenation” technique in order to observe the differences in the pitch contours of utterances between (1) citation form (i.e., a single word all on its own), (2) phrase form, and (3) a full sentence, which correspond, respectively, to the levels of prosodic word, intermediate phrase, and intonational phrase in the standard phonological hierarchy ( Nespor and Vogel, 1986 ). For example, the use of the concatenation technique resulted in the generation of corpus items that spanned from the single words “Yellow” and “Telephone,” to the adjectival phrase “The yellow telephone,” to the complete sentences “The yellow telephone rang” and “The yellow telephone rang frequently.” This allowed us to compare the tune of “yellow” in citation form to that in phrases and sentences. Gradually increasing the length of the sentences allowed us to observe the corresponding pitch changes for all the words in the sentences.
Before the experiment began, participants filled out questionnaires. They were then brought into a sound-attenuated booth and seated in front of a computer screen. Test sentences were displayed using Presentation ® software (Neurobehavioral Systems, Albany, CA, United States). All vocal recordings were made using a Sennheiser tabletop microphone, and recorded at a 44.1 kHz sampling rate as 16 bit depth WAV files on Presentation’s internal recording system. Before the test sentences were read, warm-up tasks were performed in order to assess the participant’s vocal range and habitual vocal pitch. This included throat clears, coughs, sweeps to the highest and lowest pitches, and the reading of the standard “Grandfather” passage.
Participants were next shown the items of the test corpus on a computer screen and were asked to read them aloud in an emotionally neutral manner as if they were engaging in a casual conversation. The 19 items were presented in a different random order for each participant. Each item was displayed on the screen for a practice period of 10 s during which the participant could practice saying it out loud. After this, a 10 s recording period began as the participant was asked to produce the utterance fluently twice without error. The second one was analyzed. In the event of a speech error, participants were instructed to simply repeat the item. For words that were placed under narrow focus, the stressed word or syllable was written in capital letters (e.g., “My ROOMmate had three telephones”).
In order to transcribe the pitch contour of the recorded speech, we analyzed the F 0 trajectory of the digitized speech signal using Praat ( Boersma and Weenink, 2015 ), an open-source program for the acoustic analysis of speech. Steady-state parts of the voiced portion of each syllable were manually delineated – including the vowel and preceding voiced consonants – and the average pitch (in Hz) was extracted. This was done manually for all 2,337 syllables (123 syllables × 19 participants) in the dataset. In a number of situations, the terminal pitch of a test item was spoken in creaky voice such that a reliable pitch measurement was not obtainable for that syllable. When this occurred, it affected either the last syllable of a single word spoken in citation form or the last syllable of the final word of a multi-word utterance. In both cases, it was necessary to discard the entire item from the dataset. While the preceding syllabic pitches could be estimated with accuracy, the absence of the last syllable would mean that the last interval in the group analysis would be inaccurate if the other syllables were included. For this reason, the full item was discarded. This affected 13% of the 361 test items (19 items × 19 participants), almost all of them terminal syllables spoken in creaky voice for which no reliable pitch measurement could be obtained.
Pitch-changes (intervals) were converted from Hz into “cents change” using the participant’s habitual pitch as the reference for the conversion, where 100 cents is equal to one equal-tempered semitone in music. Conversion from Hz to semitones allows for a comparison of intervals across gender and age ( Whalen and Levitt, 1995 ), as well as for group averaging of production. In order to get an estimate of a participant’s habitual pitch, we took the mean frequency of the productions of all the items in the test corpus, excluding entire items that were discarded due to creaky voice. Musical intervals were assigned after the group averaging had been completed. Intervals were assigned to the nearest semitone, assuming the 12-tone chromatic scale, where a ±50-cent criterion separated adjacent chromatic pitches. It is important to note that our transcriptions are no more accurate than the semitone level and that we did not attempt to capture microtonality in the speech signal. Hence, it sufficed for us to assign an interval to the closest reasonable semitone. For example, a major second, which is a 200 cent interval, was defined by pitch transitions occurring anywhere in the span from 150 to 249 cents. It is also important to note that “quantization” to the nearest interval was only ever done with the group data, and that all single-subject data were kept in their raw form in cents throughout all analyses. For the full-corpus analysis of intervals that will be presented in Figure 8 , intervals are shown in raw form without any rounding to semitone categories.
As a normalization procedure for the group results, the intervals were averaged across the 19 speakers and then placed onto a treble clef for visualization, with middle G arbitrarily representing the mean habitual pitch of the speakers. Transcriptions were made with Finale PrintMusic 2014.5. Note that this approach presents a picture of the relative pitch – but not the absolute pitch – of the group’s productions, where the absolute pitch was approximately an octave (females) or two (males) lower than what is represented. Virtually all of the single-participant productions fit within the range of a single octave, represented in our transcriptions as a span from middle C to the C one octave above, resulting in roughly equal numbers of semitones in either direction from the G habitual pitch. For the transcriptions presented in Figures 1 – 7 , only sharps are used to indicate non-diatonic pitches in a C major context. In addition, sharps only apply to the measure they are contained in and do not carry over to the next measure of the transcription.
FIGURE 1. Concatenation of “yellow” and “telephone” to form “The yellow telephone.” The figure shows the average relative-pitch changes associated with the syllables of the test items. For Figures 1 – 7 , the transcriptions are shown using a treble clef, with the habitual pitch arbitrarily assigned to middle G. All intervals are measured in semitones with reference to participants’ habitual pitch, down to the nearest semitone, as based on a 50-cent pitch window around the interval. (A) Citation form of “yellow.” (B) Citation form of “telephone.” (C) Concatenation to form the adjectival phrase “The yellow telephone.” Notice the contour reversal for “llow” (red box) compared to citation form in (A) . The curved line above the staff indicates a melodic arch pattern in Figures 1 – 7 .
FIGURE 2. Concatenation of “Saturday” and “morning” to form “Saturday morning”. (A) Citation form of “Saturday.” (B) Citation form of “morning.” (C) Concatenation to form the phrase “Saturday morning.” Notice the contour reversal for “tur” and “day” (red box) compared to the citation form in (A) .
FIGURE 3. Expansion of “Alanna” to “Alanna picked it up.” (A) Citation form of “Alanna.” (B) Expansion to create the sentence “Alanna picked it up.” Notice the contour reversal for “nna” (red box) compared to the citation form in (A) .
FIGURE 4. Expansion to from longer sentences starting with “The yellow telephone”. As words are added to the end of the sentence, declination is suspended on the last syllable of the previous sentence such that the characteristic drop between the penultimate and final syllables can serve to mark the end of the sentence at the newly added final word. (A) Melodic pattern for “The yellow telephone rang.” (B) Melodic pattern generated by adding the word “rang” to the end of the sentence in (A) . The red box highlights the difference in pitch height between the syllables “le-phone” in (A,B) , demonstrating the suspension of declination occurring on these syllables in (B) . (C) Melodic pattern generated by adding the word “frequently” to the end of the sentence in (B) . The red box around “phone rang” highlights the difference in pitch height between these syllables in (B,C) . The point of suspension of declination has moved from “phone” to “rang” in (C) .
FIGURE 5. Melodic contours for long sentences consisting of two intonational phrases, characterized by two melodic arches. Both sentences in this figure contain phrases based on items found in Figures 2 – 4 . The sentence in (A) combines sentence B in Figure 4 and sentence B in Figure 3 . The sentence in (B) combines sentence C in Figure 2 and sentence B in Figure 4 . The melodic contour of a long sentence consisting of two intermediate phrases shows two arched patterns, similar to those in the sentences presented in Figure 4 . These sentences provide further evidence of contour reversals, melodic arches, suspensions of declination, and terminal drops. See text for details.
FIGURE 6. Identical sentences but with narrow focus placed sequentially on different words. All four panels have the same string of words, but with narrow focus placed on either (A) my, (B) roommate, (C) three, or (D) telephone. Pitch rises are observed on the focus word in all instances but the last one. The symbol “>” signifies a point of focus or accent. For ease of presentation, only the stressed syllable of roommate and telephone is shown in block letters in the transcription.
FIGURE 7. Sentence modality. A comparison is made between (A) an imperative sentence, (B) a yes–no interrogative, and (C) a WH-interrogative. Of interest here is the absence of declination for the imperative statement in (A) , as well as the large terminal rise at the end of the yes–no interrogative in (B) .
All of the transcriptions in Figures 1 – 7 are shown with proposed rhythmic transcriptions in addition to the observed melodic transcriptions. While the purpose of the present study is to quantify speech melody, rather than speech rhythm, the study is in fact a companion to a related one about speech rhythm ( Brown et al., 2017 ). Hence, we took advantage of the insights of that study to present approximate rhythmic transcriptions of the test items in all of the figures. However, it is important to keep in mind that, while the melodic transcriptions represent the actual results of the present study, the rhythmic transcriptions are simply approximations generated by the second author and are in no way meant to represent the mean rhythmic trend of the group’s productions as based on timing measurements (as they do in Brown et al., 2017 ). In other words, the present study was not devoted to integrating our present approach to speech melody with our previous work on speech rhythm, which will be the subject of future analyses.
The results shown here are the mean pitches relative to each participant’s habitual pitch, where the habitual pitch is represented as middle G on the musical staff. While we are not reporting variability values, we did measure the standard deviation (SD) for each syllable. The mean SD across the 123 syllables in the dataset was 132 cents or 1.32 semitones. For all 19 test items, the last syllable always had the largest SD value. When the last syllable of the test items was removed from consideration, the SD decreased to 120 cents or 1.2 semitones.
Figures 1A,B show the citation forms of two individual words having initial stress, namely yellow (a two-syllable trochee) and telephone (a three-syllable dactyl). As expected for words with initial stress, there is a pitch rise on the first syllable ( Lieberman, 1960 ; Cruttenden, 1997 ; van der Hulst, 1999 ), followed by a downtrend of either two semitones (yellow) or three semitones (telephone). Figure 1C shows the union of these two words to form the adjectival phrase “The yellow telephone.” Contrary to a prediction based on a simple concatenation of the citation forms of the two words (i.e., two sequential downtrends), there is instead a contour reversal for yellow such that there is now a one-semitone rise in pitch between the two syllables (red box in Figure 1C ), rather than a two-semitone drop. “Telephone” shows a slight compression of its pitch range compared to citation form, but no contour reversal. The end result of this concatenation to form an adjectival phrase is a melodic arch pattern (shown by the curved line above the staff in Figure 1C ), with the pitch peak occurring, paradoxically, on the unstressed syllable of yellow. The initial and final pitches of the phrase are nearly the same as those of the two words in citation form.
Figures 2A,B show a similar situation, this time with the initial word having three syllables and the second word having two syllables. As in Figure 1 , the citation forms of the words show the expected downtrends, three semitones for Saturday and two semitones for morning. Similar to Figure 1 , the joining of the two words to form a phrase results in a contour change, this time a flattening of the pitches for Saturday (Figure 2C , red box), rather than the three-semitone drop seen in citation form. A similar type of melodic arch is seen here as for “The yellow telephone.” As with that phrase, the initial and final pitches of the phrase are nearly the same as those of the two contributing words in citation form. “Morning” shows a small compression, as was seen for “telephone” in Figure 1C .
Figure 3 presents one more example of the comparison between citation form and phrasal concatenation, this time where the word of interest does not have initial stress: the proper name Alanna (an amphibrach foot). Figure 3A demonstrates that, contrary to expectations, there is not a pitch rise on the second (stressed) syllable of the word, but that the syllable was spoken with the identical pitch as the first syllable. This is followed by a two-semitone downtrend toward the last syllable of the word. Adding words to create the sentence “Alanna picked it up” again produces a contour reversal to create a melodic arch centered on the unstressed terminal syllable of Alanna (Figure 3B , red box).
Figure 4 picks up where Figure 1 left off. Figure 4A recopies the melody of the phrase “The yellow telephone” from Figure 1C . The next two items create successively longer sentences by adding words to the end, first adding the word “rang” and then adding the word “frequently” to the latter sentence. Figure 4B shows that the downtrend on “telephone” that occurred when “telephone” was the last word of the utterance is minimized. Instead, there is a suspension of declination by a semitone (with reference to the absolute pitch, even though the interval between “le” and “phone” is the same in relative terms). The downtrend then gets shifted to the last word of the sentence, where a terminal drop of a semitone is seen. Figure 4C shows a similar phenomenon, except that the word “rang” is part of the suspended declination. The downtrend in absolute pitch now occurs on “frequently,” ending the sentence slightly below the version ending in “rang.” Overall, we see a serial process of suspension of declination as the sentence gets lengthened. One observation that can be gleaned from this series of sentences is that the longer the sentence, the lower the terminal pitch, suggesting that longer sentences tend to have a larger pitch range than shorter sentences. This is also shown by the fact that “yellow” attains a higher pitch in this sentence than in the shorter sentences, resulting in an overall range of five semitones, compared to three semitones for “the yellow telephone.” Hence, for longer sentences, expansions occur at both ends of the pitch range, not just at the bottom.
Figure 5 compounds the issue of sentence length by now examining sentences with two distinct intonational phrases, each sentence with a main clause and a subordinate clause. The transcriptions now contain two melodic arches, one for each intonational phrase. For illustrative purposes, the phrases of these sentences were all designed to contain components that are found in Figures 1 – 4 . For the first sentence (Figure 5A ), the same suspension of declination occurs on the word “rang” as was seen in Figure 4C . That this is indeed a suspension process is demonstrated by the fact that the second intonational phrase (the subordinate clause) starts on the last pitch of the first one. The second phrase shows a similar melody to that same sentence in isolation (Figure 3B ), but the overall pattern is shifted about two semitones downward and the pitch range is compressed, reflecting the general process of declination. Finally, as with the previous analyses, contour reversals are seen with both “yellow” and “Alanna” compared to their citation forms to create melodic arches.
A very similar set of melodic mechanisms is seen for the second sentence (Figure 5B ). A suspension of declination occurs on “morning” (compared to its phrasal form in Figure 2C ), and the second intonational phrase starts just below the pitch of “morning.” The phrase “On Saturday morning” shows an increase in pitch height compared to its stand-alone version (Figure 2B ). In the latter, the pitches for Saturday are three unison pitches, whereas in the longer sentence, the pitches for Saturday rise two semitones, essentially creating an expansion of the pitch range for the sentence. This suggests that longer sentences map out larger segments of pitch space than shorter sentences and that speakers are able to plan ahead by creating the necessary pitch range when a long utterance is anticipated. The second phrase, “the yellow telephone rang,” has a similar, though compressed, intervallic structure compared to when it was a stand-alone sentence (Figure 4B ), indicating declination effects. In addition, the phrase occurs lower in the pitch range (1–2 semitones) compared to both the stand-alone version and its occurrence in the first phrase of Figure 5A , as can be seen by the fact that the transition from “phone” to “rang” is G to F# in the first sentence and F to E in the second. Overall, for long sentences consisting of two intonational phrases, the melody of the first phrase seems to be located in a higher pitch range and shows larger pitch excursions compared to the second intonational phrase, which is both lower in range and compressed in interval size. In other words, more melodic movement happens in the first phrase. As was seen for the set of sentences in Figure 4 , expansions in pitch range for longer sentences occur at both ends of the range, not just at the bottom.
Figure 6 examines the phenomenon of narrow focus, where a given word in the sentence is accented in order to place emphasis on its information content. Importantly, the same string of words is found in all four sentences in the figure. All that differs is the locus of narrow focus, which was indicated to participants using block letters for the word in the stimulus sentences. Words under focus are well-known to have pitch rises, and this phenomenon is seen in all four sentences, where a pitch rise is clearly visible on the word under focus, and more specifically its stressed syllable in the case of polysyllabic words “roommate” and “telephone.” All sentences showed terminal drops between “le” and “phones,” although this drop was largest in the last sentence, where the pitch rise occurred on “telephone” and thereby led to an unusual maintenance of high pitch at the end of a sentence. Perhaps the major point to be taken from the results in Figure 6 is that each narrow-focus variant of the identical string of words had a different melody. Another interesting effect is the contour inversion for “roommate” that occurs when this word precedes the pitch accent (the 1-semitone rise in Figures 6C,D ), compared to when it follows it (Figure 6A ) or is part of it (Figure 6B ). This suggests that, in the former cases, speakers maintain their pitch in the high range in preparation for an impending pitch accent later in the sentence.
Figure 7 looks beyond declaratives to examine both an imperative statement and two types of interrogatives, namely a yes–no and a WH question (where WH stands for question-words like what, where, and who). Figure 7A presents a basic command: “Telephone my house!”. The sentence shows a compressed pitch pattern at a relatively high part of the range, but with a small melodic arch to it, perhaps indicative of the high emotional intensity of an imperative. One noticeable feature here is the loss of the terminal drop that is characteristic of declarative sentences and even citation forms. Instead, pitch is maintained at one general level, making this the most monotonic utterance in the dataset. Perhaps the only surprising result is that the initial stressed syllable “Te” has a slightly lower pitch than the following syllable “le” (79 cents in the raw group data), whereas we might have predicted a slightly higher pitch for the first syllable of a dactyl, as seen in the citation form of “telephone” in Figure 1B . Hence, a small degree of arching is seen with this imperative sentence. This stands in contrast to when the first word of a sentence is under narrow focus, as in Figure 6A (“MY roommate has three telephones”), where that first word clearly shows a pitch rise.
Figures 7B,C present a comparison between the two basic types of questions. The results in Figure 7B conform with the predicted pattern of a yes–no question in English, with its large pitch rise at the end ( Bolinger, 1989 ; Ladd et al., 1999 ; Ladd, 2008 ; Féry, 2017 ). The terminal rise of 4 semitones is one of the largest seen in the dataset. The melodic pattern preceding the terminal rise is nearly flat, hence directing all of the melodic motion to the large rise itself. Two features of this sentence are interesting to note. First, whereas long declarative sentences tend to end about three semitones below the habitual pitch, the yes–no question ended a comparable number of semitones above the habitual pitch. Hence, the combination of a declarative sentence and a yes–no interrogative map out the functional pitch range of emotionally neutral speech, which is approximately eight semitones or the interval of a minor 6th. Second, the melodic pattern for “telephone” during the terminal rise is opposite to that in citation form (Figure 1B ). Next, Figure 7C presents the pattern for the WH question “Whose telephone is that?”. The melody is nearly opposite in form to the yes–no question, showing a declining pattern much closer to a declarative sentence, although it lacks the arches seen in declaratives. In this regard, it is closer to the pattern seen with the imperative, although with a larger pitch range and a declining contour. Potential variability in the intonation of this question is discussed in the “Limitations” section below. Overall, the yes–no question and WH-question show strikingly different melodies, as visualized here with notation.
Figure 8 looks at the occurrence of interval categories across all productions of the 19 test-items by the 19 participants. A total of 1700 intervals was measured after discarding items having creaky voice on the terminal syllable. Among the intervals, 37% were ascending intervals (0 cents is included in this group), while 63% were descending intervals. The mean interval size was -45 cents. Fully 96% of the intervals sit in the range of -400 to +400 cents. In other words, the majority of intervals are between a descending major third and an ascending major third, spanning a range of eight semitones or a minor 6th. The figure shows that speech involves small intervallic movements, predominantly unisons, semitones and whole tones, or microtonal intervals in between them. A look back at the transcriptions shows that speech is quite chromatic (on the assumption that our approximation of intervals to the nearest semitone is valid). It is important to point out that the continuous nature of the distribution of spoken intervals shown in Figure 8 is quite similar to the continuous nature of sung intervals for the singing of “Happy Birthday” found in Pfordresher and Brown (2017) . Hence, spoken intervals appear to be no less discrete than sung intervals.
FIGURE 8. Frequency distribution of interval use in the test corpus. This figure presents the relative frequency of pitch-intervals across the 19 test items and the 19 participants. The y -axis represents the absolute frequency of each interval from a pool of 1700 intervals. Along the x -axis are the intervals expressed as cents changes, where 100 cents is one equal-tempered semitone. Descending intervals are shown in red on the left side, and ascending intervals are shown in blue on the right side, where the center of the distribution is the unison interval having no pitch change (i.e., two repeated pitches), which was color coded as blue. 96% of the intervals occur in the span of –400 to +400 cents.
Large intervals are rare. They were only seen in situations of narrow focus (Figure 6 ) and the yes–no interrogative (Figure 7B ), both cases of which were ascending intervals. Large descending intervals were quite rare. A look at the ranges of the sentences across the figures shows that the longest sentences had the largest ranges. Expansion of the range occurred at both the high and low ends, rather than simply involving a deeper declination all on its own, suggestive of phonatory planning by speakers. However, even the longest sentences sat comfortably within the span of about a perfect fifth (seven semitones), with roughly equal sub-ranges on either side of the habitual pitch.
It is difficult to address the question of whether there are scales in speech, since even our longest sentences had no more than 15 pitches, and the constituent intonational phrases had only about 8 pitches. If scaling is defined by the recurrence of pitch classes across a melody, then the overall declination pattern that characterizes the melody of speech does not favor the use of scales. If nothing else, there seems to be a coarse type of chromaticism to the pitch pattern of speech, with semitones (or related microtonal variants) being the predominant interval type beyond the unison. Our working hypothesis is that scaling is a domain-specific feature of music, and that speech is basically an atonal phenomenon by comparison, which makes use of a weak type of chromaticism, operating within the compressed pitch range of standard speech production.
We have presented an analysis of speech melody that differs from all contemporary approaches in linguistics but that has similarities to Joshua Steele’s 1775 attempt to capture the melody of speech using symbols similar to musical notation on a musical staff. Compared to other current approaches that merely indicate points of salience or transition in the speech signal, our method permits a quantification of all of the relevant pitch events in a sentence, and does so in a manner that allows for both comparison among speakers and group averaging. This permits a global perspective on speech melody, in addition to simply considering pitch changes between adjacent syllables/tones. We have used this method to analyze a number of key phonetic and phonological phenomena, such as individual words, intonational phrases, narrow focus, and modality. In all cases, the results have provided quantitative insight into these phenomena in a manner that approaches using qualitative graphic markers like H(igh) and L(ow) are unable to.
The general method that we are presenting here consists of three major components: (1) a method for transcribing and thus visualizing speech melody, ultimately uniting melody and rhythm; (2) use of the transcriptions to analyze the structural dynamics of speech melody in terms of intervallic changes and overall pitch movement; and (3) a higher-level interpretation of the pitch dynamics in terms of the phonological meaning of intonation as well as potential comparisons between language and music (e.g., scales, shared prosodic mechanisms). Having used Figures 1 – 7 to demonstrate the visualization capability of musical transcription, we will now proceed to discuss the results in terms of the dynamics of speech melody.
Figure 9 attempts to summarize the major findings of the study by consolidating the results into a generic model of sentence melody for a long declarative sentence containing two principal intonational phrases (as in Figure 5 ). Before looking at the full sentences in the corpus, we first consider the citation form of the individual polysyllabic words that were analyzed. All of them showed the expected phenomenon of a pitch rise on the stressed syllable. This was seen with the words yellow, telephone, Saturday, and morning in Figures 1 – 4 , but only minimally with Alanna, which showed a pitch drop on the last syllable but not a pitch rise on the stressed syllable.
FIGURE 9. Generic model of the melody of a long declarative sentence. The left side of the figure shows the approximate pitch range for the emotionally neutral intonation of a standard speaker, with the habitual pitch marked in the center, and the functional pitch range mapped out as pitch space above and below the habitual pitch. See text for details about the mechanisms shown. P4, perfect 4th.
Looking now to the melodic dynamics of phrases and full sentences, we noted a number of reproducible features across the corpus of test items, as summarized graphically in Figure 9 .
(1) The starting pitch of a sentence tended to be mid-register, at or near a person’s habitual vocal pitch (represented in our transcriptions as middle G). An analysis of the pitch-range data revealed that the habitual pitch was, on average, five semitones or a perfect 4th above a person’s lowest pitch.
(2) Sentences demonstrated an overall declination pattern, ending as much as four semitones below the starting pitch, in other words very close to participants’ low pitch. Much previous work has demonstrated declination of this type for English intonation ( Lieberman et al., 1985 ; Ladd et al., 1986 ; Yuan and Liberman, 2014 ). The exception in our dataset was the yes–no interrogative, which instead ended at a comparable number of semitones above the habitual pitch. The combination of a declarative and a yes–no interrogative essentially mapped out the functional pitch range of the speakers’ productions in the dataset.
(3) That pitch range tended to span about 4–5 semitones in either direction from the habitual pitch for the emotionally neutral prosody employed in the study, hence close to an octave range overall.
(4) Longer sentences tended to occupy a larger pitch range than single words or shorter phrases. The expansion occurred at both ends of the pitch range, rather than concentrating all of the expansion as a greater lowering of the final pitch.
(5) Sentences tended to be composed of one or more melodic arches , corresponding more or less to intonational phrases.
(6) Paradoxically, the peak pitch of such arches often corresponded with an unstressed syllable of a polysyllabic word, typically the pitch that followed the stressed syllable.
(7) This was due to the contour reversal that occurred for these words when they formed melodic arches, as compared to the citation form of these same words, which showed the expected pitch rise on the stressed syllable.
(8) The pitch peak of the arch was quantified intervallically as spanning anywhere from 1 to 3 semitones above the starting pitch of the sentence.
(9) However, melodic arches and other types of pitch accents (like narrow focus) underwent both a pitch lowering and compression when they occurred later in the sentence, such as in the second intonational phrase of a multi-phrase sentence. In other words, such stress points showed lower absolute pitches and smaller pitch excursions compared to similar phenomena occurring early in the sentence. Overall, for long sentences consisting of two intonational phrases, the melodic contour of the first phrase tended to be located in a higher part of the pitch range and showed larger pitch excursions compared to the second intonational phrase, which was both lower and compressed.
(10) For sentences with two intonational phrases, there was a suspension of declination at the end of the first phrase, such that it tended to end at or near the habitual pitch. This suggests that speakers were able to plan out long sentences at the physiological level and thereby create a suitable pitch range for the production of the long utterance. It also suggests that the declarative statement is a holistic formula, such that changes in sentence length aim to preserve the overall contour of the formula.
(11) Sentences tended to end with a small terminal drop , on the order of a semitone or two. The exceptions were the imperative, which lacked a terminal drop, and the yes–no interrogative, which instead ended with a large terminal rise .
(12) The terminal pitch tended to be the lowest pitch of a sentence, underlining the general process of declination. Again, the major exception was the yes–no interrogative.
(13) For declarative sentences, there was a general pattern such that large ascending intervals occurred early in the sentence (the primary melodic arch, Figure 9 ), whereas the remainder of the sentence showed a general process of chromatic descent. This conforms with an overarching driving mechanism of declination.
(14) The overall pitch range tended to be larger in longer sentences, and the terminal pitches tended to be lower as well, by comparison to single words or short phrases.
(15) Speech seems to be dominated by the use of small melodic intervals , and hence pitch proximity. Unisons were the predominant interval type, followed by semitones and whole tones, a picture strikingly similar to melodic motion in music ( Vos and Troost, 1989 ; Huron, 2006 ; Patel, 2008 ).
(16) Our data showed no evidence for the use of recurrent scale patterns in speech. Instead, the strong presence of semitones in the pitch distribution suggested that a fair degree of chromaticism occurs in speech. Hence, speech appears to be atonal.
Having summarized the findings of the study according to the musical approach, we would like to consider standard linguistic interpretations of the same phenomena.
When pronounced in isolation, the stressed syllables of polysyllabic words such as “yellow” and “Saturday” were aligned with a high pitch. The melodic contour then dropped two semitones for the second syllable, resembling that of an utterance-final drop. On the other hand, when “yellow” and “Saturday” were followed by additional words to form short phrases, the melodic contour seen in citation form was inverted, resulting in pitch peaks on the unstressed syllables of these words. Figure 10 presents a direct comparison between a ToBI transcription and a musical transcription for the yellow telephone. AM theory postulates that the pitch-drop in the citation forms of “yellow” and “telephone” represents the transition between the pitch accent (on the stressed syllable) and the boundary tone. In “The yellow telephone,” the (1-semitone) rise would be treated as a transition between the first H ∗ pitch accent on “yel-” and the H of the H-L-L% boundary tone. But this rise is never treated as a salient phonological event. This change motivates AM theory to consider the “H ∗ -L-L%” tune as compositional, which can be associated with utterances of different lengths ( Beckman and Pierrehumbert, 1986 ). Nonetheless, it is not clear as to why H ∗ entails a 1-semitone rise, whereas H-L% is manifested by a two semitone drop.
FIGURE 10. Comparing ToBI transcription and musical transcription. The figure presents a comparison between a ToBI transcription (A) and a musical transcription (B) of “The yellow telephone.” (B) Is a reproduction of Figure 1C .
The observed phrase-level arches – with their contour reversals on polysyllabic words – ultimately lead to the formation of sentence arches in longer sentences. The results of this study indicate that, in general, the melodic contours of utterances consisting of simple subject–verb–object sentences tend to be characterized by a series of melodic arches of successively decreasing pitch height and pitch range. Paradoxically, these arches very often peaked at a non-nuclear syllable, as mentioned above. Additional arches were formed as the sentence was lengthened by the addition of more words or intonational phrases. Moreover, declination was “suspended” when additional syllables were inserted between the pitch accent and the boundary tone (Figure 4 ) or when intonational phrases were added as additional clauses to the sentence (Figure 5 ). To the best of our knowledge, no linguistic theory of speech melody accounts for this suspension. In addition, speakers adjust their pitch range at both ends when producing a long utterance consisting of two intonational phrases. Again, as far as we know, the ability to represent pitch-range changes across a sentence is unique to our model. With both phrases and the boundary tone sharing the same pitch range, the pitch range occupied by each phrase becomes narrower. The first phrase starts at a higher pitch than normal and occupies the higher half of the shared pitch range, while the second phrase begins at a lower pitch and occupies the lower half of the pitch range. At the end of the second phrase, the phrase-final drop is reduced.
To a first approximation, our melodic arches map onto the intonational phrases of phonological theory, suggesting that these arches constitute a key building block of speech melody. For standard declarative sentences, the arches show a progressive lowering in absolute pitch and a narrowing in relative pitch over the course of the sentence, reflecting the global process of declination. Melodic contours of English sentences have been characterized as consisting of different components when comparing the British school with AM theory ( Cruttenden, 1997 ; Gussenhoven, 2004 ; Ladd, 2008 ). Despite this, Collier (1989) and Gussenhoven (1991) described a “hat-shape” melodic pattern for Dutch declarative sentences that might be similar to what we found here for English. Whether we are describing the sentence’s speech melody holistically as an arch or dividing the melodic contour into components, we are essentially describing the same phenomenon.
Comparing the different readings of “My roommate had three telephones” when narrow focus was placed on “my,” “roommate,” “three,” and “telephones” (see Figure 6 ), the results revealed that the stressed syllable of the word under focus was generally marked by a pitch rise of as much as three semitones, except when it occurred on the last word of a sentence, where this pitch jump was absent. Pitch peaks were aligned to the corresponding segmental locations. Both observations are consistent with current research on narrow focus and pitch-segmental alignment in spoken English ( Ladd et al., 1999 ; Atterer and Ladd, 2004 ; Dilley et al., 2005 ; Xu and Xu, 2005 ; Féry, 2017 ). Xu and Xu’s (2005) prosodic study of narrow focus in British English indicated that, when a word is placed under narrow focus, the pre-focus part of the sentence remains unchanged. This effect is observed in the initial part of the sentences in Figures 6C,D , in which the melody associated with “my roommate had” remains unchanged in the pre-focus position. Secondly, Xu and Xu (2005) and Féry (2017) reported that the word under narrow focus is pronounced with a raised pitch and expanded range, whereas the post-focus part of the sentence is pronounced with a lower pitch and more restricted pitch range. These effects were also observed by comparing the sentences Figures 6C,D with those in Figures 6A,B . The latter part of the sentence “had three telephones” was pronounced in a lower and more compressed part of the pitch range when it is in the post-focus position. Overall, the use of a musical approach to describe narrow focus not only allows us to observe previously reported effects on the pre-, in-, and post-focus parts of the sentence, but it provides a means of quantifying these effects in terms of pitch changes.
Research in intonational phonology in English indicates that imperative and declarative sentences, as well as WH-questions, are generally associated with a falling melodic contour, whereas the correspondence between speech melody and yes–no (“polar”) questions is less straightforward. Yes–no questions with syntactic inversion (e.g., “Are you hungry?”) are generally associated with a falling melodic contour, whereas those without inversion (e.g., “You are hungry?”) are associated with a rising contour ( Crystal, 1976 ; Geluykens, 1988 ; Selkirk, 1995 ). In addition, questions typically involve some element of high pitch ( Lindsey, 1985 ; Bolinger, 1989 ), whereas such features are absent in statements. While our results are in line with these observations, the comparison of statement and question contours using melodic notation allows us to pinpoint the exact amplitude of the final rises and falls associated with each type of question. Furthermore, it allows us to represent and quantify the difference in global pitch height associated with questions as opposed to statements. This phonologically salient feature is missing in AM and CR, which only account for localized pitch excursions.
The Introduction presented a detailed analysis of the dominant approaches to speech melody in the field of phonology. We would now like to consider the advantages that a musical approach offers over those linguistic approaches.
Many analyses of speech melody in the literature are based on qualitative representations that show general trajectories of pitch movement in sentences (e.g., Cruttenden, 1997 ). While useful as heuristics, such representations are inherently limited in scope. Our method is based first and foremost on the acoustic production of sentences by speakers. Hence, it is based on quantitative experimental data, rather than qualitative representations.
This is in contrast to the use of qualitative labels like H and L in ToBI transcriptions. The musical approach quantifies and thus characterizes the diversity of manners of melodic movement in speech in order to elucidate the dynamics of speech melody. In ToBI, an H label suggests a relative rise in pitch compared to preceding syllables, but that rise is impossible to quantify with a single symbol. The conversion of pitch changes into musical intervals permits a precise specification of the types of pitch movements that occur in speech. This includes both local effects (e.g., syllabic stress, narrow focus, terminal drop) and global effects (e.g., register use, size of a pitch range, melodic arches, intonational phrases, changes with emotion). Ultimately, this approach can elucidate the melodic dynamics of speech prosody, both affective prosody and linguistic prosody.
This is again in contrast to methods like ToBI that only mark salient pitch events and ignore the remainder of the syllables. Hence, the musical method can provide a comprehensive analysis of the pitch properties of spoken sentences, including the melodic phenomena analyzed here, such pitch-range changes, post-focus compression, lexical stress, narrow focus, sentence modality, and the like. This is a feature which the musical model shares with PENTA.
The use of relative pitch to analyze melodic intervals provides a means of normalizing the acoustic signal and comparing melodic motion across speakers. Hence, normalization can be done across genders (i.e., different registers) and across people having different vocal ranges. In fact, any two individual speakers can be compared using this method. Using relative pitch eliminates many of the problems associated with analyzing speech melody using absolute pitch in Hz. No contemporary approach to speech melody in linguistics provides a reliable method of cross-speaker comparison.
Along the lines of the last point, converting Hz values into cents or semitones opens the door to group averaging of production. Averaging is much less feasible using Hz due to differences in pitch range, for example between women and men. Group averaging using cents increases the statistical power and generalizability of the experimental data compared to methods that use Hz as their primary measurement.
A transcriptional approach can be used to capture pitch-based variability in production, as might be associated with regional dialects, foreign accents, or even speech pathology (e.g., hotspots for stuttering, Karniol, 1995 ). As we will argue in the Limitations section below, it can also capture the variability in the intonation of a single sentence across speakers, much as our analysis of narrow focus did in Figure 6 , showing that each variant was a accompanied by a distinct melody.
Virtually all approaches to speech prosody look at either melody or rhythm alone. Following on the landmark work of Joshua Steele in 1775, we believe that our use of musical notation provides an opportunity for such a unification. We have developed a musical model of speech rhythm elsewhere ( Brown et al., 2017 ). That model focuses on meters and the relative duration values of syllables within a sentence. We used approximate rhythmic transcriptions in the current article (Figures 1 – 7 ) to demonstrate the potential of employing a combined analysis of melody and rhythm in the study of speech prosody. We hope to do a combined rhythm/melody study as the next phase of our work on the musical analysis of speech.
As mentioned in the Introduction, RaP (Rhythm and Pitch) is perhaps the one linguistic approach that takes into account both speech rhythm and melody, albeit as completely separate parameters ( Dilley and Brown, 2005 ; Breen et al., 2012 ). RaP considers utterances as having “rhythm,” which refers to pockets of isochronous units in lengthy strings of syllables (at least 4–5 syllables, and up to 8–10 syllables). In addition, “strong beats” associate with lexically stressed syllables based on metrical phonology. RaP is the first recent model to make reference to the musical element of “beat” in describing speech rhythm, implying that some isochronous units of rhythm exist at the perceptual level. However, the assignment of rhythm and prominence relies heavily on transcribers’ own perception, rather than on empirical data.
The use of musical notation for speech provides a means of effecting comparative analyses of speech and music. For example, we explored the question of whether speech employs musical scales, and concluded provisionally that it does not. There are many other types of questions about the relationship between speech prosody and music that can be explored using transcription and musical notation. This is important given the strong interest in evolutionary models that relate speech and music ( Brown, 2000 , 2017 ; Mithen, 2005 ; Fitch, 2010 ), as well as cognitive and neuroscientific models that show the use of overlapping resources for both functions ( Juslin and Laukka, 2003 ; Besson et al., 2007 ; Patel, 2008 ; Brandt et al., 2012 ; Bidelman et al., 2013 ; Heffner and Slevc, 2015 ). For example, it would be interesting to apply our analysis method to a tone language and attempt to quantify the production of lexical tones in speech, since lexical tone is thought of as a relative-pitch system comprised of contrastive level tones and/or contour tones.
The use of a musical score is the only visual method that can allow a person to reproduce the prosody of some other person. Hence, the score can be “sung” much the way that music is. While this is certainly an approximation of the pitch properties of real speech, it is unquestionably a huge improvement over any existing method in linguistics, including Prosogram. A system integrating speech rhythm and melody could enable the development of more-effective pedagogical tools to teach intonation to non-native language learners. Moreover, knowledge gleaned from this research can be applied to improve the quality and naturalness of synthesized speech.
In addition to the advantages of the musical approach, there are also a number of limitations of our study and its methods. First, we used a simple corpus with relatively simple sentences. We are currently analyzing a second dataset that contains longer and more complex sentences than the ones used in the present study. These include sentences with internal clauses, for example. Second, our pitch analysis is very approximate and is no more fine-grained than the level of the semitone. All of our analyses rounded the produced intervals to the nearest semitone. If speech uses microtonal intervals and scales, then our method at present is unable to detect them. Likewise, our association of every syllable with a level tone almost certainly downplays the use of contour tones (glides) in speech. Hence, while level tones should be quite analyzable with our method, our approach does not currently address the issue of intra-syllable pitch variability, which would be important for analyzing contour tones in languages like Mandarin and Cantonese. Prosogram permits syllabic pitches to be contoured, rather than level, but our approach currently errs on the side of leveling out syllabic pitches. In principle, contour tones could be represented as melismas in musical notation by representing the two pitches that make up the endpoints of the syllable and using a “portamento” (glide) symbol to suggest the continuity of pitch between those endpoints. A similar approach could even be used to represent non-linguistic affective vocalizations.
The current approach requires that users be familiar with musical notation and the concept of musical intervals. Will this limit the adoptability of the approach? In our opinion, it is not much more difficult to learn how to read musical notation than it is to learn how to read ToBI notation, with its asterisks and percentage signs. In principle, pitch contours should be easily recognizable in musical notation, even for people who cannot read it. Hence, the direction and size of intervals should be easy to detect, since musical notation occurs along a simple vertical grid, and pitch changes are recognizable as vertical movements, much like lines representing F 0 changes. In contrast to this ease of recognition, ToBI notation can be complex. The fact that H ∗ H means a flat tone is completely non-intuitive for people not trained in ToBI. The most “musical” part of musical notation relates to the interval classes themselves. This type of quantification of pitch movement is not codable at all with ToBI and thus represents a novel feature that is contributed by musical notation.
Our sample showed a gender bias in that 16 of the 19 participants were female. The literature suggests that females show greater F 0 variability than males ( Puts et al., 2011 ; Pisanski et al., 2016 ) and that they have a higher incidence of creaky voice ( Yuasa, 2010 ). Creaky voice was, in fact, a problem in our analysis, and this might have been well due to the high proportion of females in our sample. Future studies should aim to have a more balanced gender representation than we were able to achieve in this study.
Finally, while our normalization of the speech signal into semitones provides a strong advantage in that it permits group averaging, such averaging also comes at the cost of downplaying individual-level variability. Perhaps instead of averaging, it would be better to look at the families of melodies for a single sentence that is produced by a group of speakers, and put more focus on the individual-level variability than on group trends. In order to illustrate the multiple ways that a single sentence can be intoned, we revisit the WH-question that was analyzed in Figure 7C : “Whose telephone is that?”. Figure 11 uses rhythmic transcription to demonstrate three different manners of intoning this question, the first of which was used in Figure 7C (for simplicity, a single G pitch is used in all transcriptions). Each variant differs based on where the point of focus is, as shown by the word in block letters in each transcription. We chose the version in Figure 10A for our group analysis in Figure 7C , since the melodic pattern of the group average best fit that pattern, with its high pitch on “whose,” rather than on “telephone” or “is.” Hence, while the examination of group averages might tend to downplay inter-participant variability, the transcriptional approach is able to capture the family of possible variants for a given sentence and use them as candidates for the productions of individuals and groups.
FIGURE 11. Variability in the intonation of a single sentence. The figure uses rhythmic notation to demonstrate three possible ways of intoning the same string of words, where stress is placed on either whose (A) , telephone (B) , or is (C) . It is expected that different melodic patterns would be associated with each rendition, based on where the point of focus is, which would be an attractor for a pitch rise. The proposed point of narrow focus is represented using bold text in each sentence. The symbol “>” signifies a point of focus or accent. The pattern in (A) was most consistent with the analyzed group-average melody shown in Figure 7C .
The musical method that we are presenting here consists of three major components: (1) a method for transcribing and thus visualizing speech melody, ultimately uniting melody and rhythm into a single system of notation; (2) use of these transcriptions to analyze the structural dynamics of speech melody in terms of intervallic changes and pitch excursions; and (3) a higher-level interpretation of the descriptive pitch dynamics in terms of the phonological meaning of intonation as well as potential comparisons between speech and music (e.g., scales, shared prosodic mechanisms). Application of this approach to our vocal-production experiment with 19 speakers permitted us to carry out a quantitative analysis of speech melody so as to look at how syntax, utterance length, narrow focus, declination, and sentence modality affected the melody of utterances. The dominant linguistic models of speech melody are incapable of accounting for such effects in a quantifiable manner, whereas such melodic changes can be easily analyzed and represented with a musical analysis. This can be done in a comprehensive manner such that all syllables are incorporated into the melodic model of a sentence. Most importantly, the use of a musical score has the potential to combine speech melody and rhythm into a unified representation of speech prosody, much as Joshua Steele envisioned in 1775 with his use of “peculiar symbols” to represent syllabic pitches. Musical notation provides the only available tool capable of bringing about this unification.
IC and SB analyzed the acoustic data and wrote the manuscript.
This work was funded by a grant from the Natural Sciences and Engineering Research Council (NSERC) of Canada to SB.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We thank Jordan Milko for assistance with data collection and analysis, and Nathalie Stearns and Jonathan Harley for assistance with data analysis. We thank Peter Pfordresher and Samson Yeung for critical reading of the manuscript.
Atterer, M., and Ladd, D. R. (2004). On the phonetics and phonology of “segmental anchoring” of F0: evidence from German. J. Phon. 32, 177–197. doi: 10.1016/S0095-4470(03)00039-1
CrossRef Full Text | Google Scholar
Beckman, M. E., and Ayers, G. (1997). Guidelines for ToBI Labeling, Version 3. Columbus, OH: Ohio State University.
Google Scholar
Beckman, M. E., and Pierrehumbert, J. B. (1986). Intonational structure in Japanese and English. Phonology 3, 255–309. doi: 10.1017/S095267570000066X
Besson, M., Schön, D., Moreno, S., Santos, A., and Magne, C. (2007). Influence of musical expertise and musical training on pitch processing in music and language. Restor. Neurol. Neurosci. 25, 399–410.
Bidelman, G. M., Hutka, S., and Moreno, S. (2013). Tone language speakers and musicians share enhanced perceptual and cognitive abilities for musical pitch: evidence for bidirectionality between the domains of language and music. PLoS One 8:e60676. doi: 10.1371/journal.pone.0060676
PubMed Abstract | CrossRef Full Text | Google Scholar
Boersma, P., and Weenink, D. (2015). Praat: Doing Phonetics By Computer Version 5.4.22. Available at: http://www.praat.org/ [accessed October 8, 2015].
Bolinger, D. (1989). Intonation and its Uses: Melody in Grammar and Discourse. Stanford, CA: Stanford University Press.
Brandt, A., Gebrian, M., and Slevc, L. R. (2012). Music and early language acquisition. Front. Psychol. 3:327. doi: 10.3389/fpsyg.2012.00327
Breen, M., Dilley, L. C., Kraemer, J., and Gibson, E. (2012). Inter-transcriber reliability for two systems of prosodic annotation: ToBI (Tones and Break Indices) and RaP (Rhythm and Pitch). Corpus Linguist. Linguist. Theory 8, 277–312. doi: 10.1515/cllt-2012-0011
Brown, S. (2000). “The ‘musilanguage’ model of music evolution,” in The Origins of Music , eds N. L. Wallin, B. Merker and S. Brown (Cambridge, MA: MIT Press), 271–300.
Brown, S. (2017). A joint prosodic origin of language and music. Front. Psychol. 8:1894. doi: 10.3389/fpsyg.2017.01894
Brown, S., Pfordresher, P., and Chow, I. (2017). A musical model of speech rhythm. Psychomusicology 27, 95–112. doi: 10.1037/pmu0000175
Bruce, G. (1977). Swedish Word Accents in Sentence Perspective. Malmö: LiberLäromedel/Gleerup.
Cohen, A., Collier, R., and ‘t Hart, J. (1982). Declination: construct or intrinsic feature of speech pitch? Phonetica 39, 254–273. doi: 10.1159/000261666
Collier, R. (1989). “On the phonology of Dutch intonation,” in Worlds Behind Words: Essays in Honour of Professor FG Droste on the Occasion of his Sixtieth Birthday , eds F. J. Heyvaert and F. Steurs (Louvain: Leuven University Press), 245–258.
Cruttenden, A. (1997). Intonation, 2nd Edn. Cambridge: Cambridge University Press. doi: 10.1017/CBO9781139166973
CrossRef Full Text
Crystal, D. (1976). Prosodic Systems and Intonation in English. Cambridge: CUP Archive.
Dilley, L., and Brown M. (2005). The RaP (Rhythm and Pitch) Labeling System. Cambridge, MA: Massachusetts Institute of Technology.
Dilley, L., Ladd, D. R., and Schepman, A. (2005). Alignment of L and H in bitonal pitch accents: testing two hypotheses. J. Phon. 33, 115–119. doi: 10.1016/j.wocn.2004.02.003
Fairbanks, G., and Pronovost, W. (1939). An experimental study of the pitch characteristics of the voice during the expression of emotion. Commun. Monogr. 6, 87–104 doi: 10.1080/03637753909374863
Féry, C. (2017). Intonation and Prosodic Structure. Cambridge: Cambridge University Press. doi: 10.1017/9781139022064
Fitch, W. T. (2010). Evolution of Language. Cambridge: Cambridge University Press.
Fujisaki, H. (1983). “Dynamic characteristics of voice fundamental frequency in speech and singing,” in The Production of Speech , ed. P. F. MacNeilage (New York, NY: Springer), 39–55.
Fujisaki, H., and Gu, W. (2006). “Phonological representation of tone systems of some tone languages based on the command-response model for F0 contour generation,” in Proceedings of the Tonal Aspects of Languages , Berlin, 59–62.
Fujisaki, H., and Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J. Acoust. Soc. Jpn. E5, 233–242. doi: 10.1250/ast.5.233
Fujisaki, H., Ohno, S., and Wang, C. (1998). “A command-response model for F0 contour generation in multilingual speech synthesis,” in Proceedings of the Third ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis , Jenolan Caves, 299–304.
Geluykens, R. (1988). On the myth of rising intonation in polar questions. J. Pragmat. 12, 467–485. doi: 10.1016/0378-2166(88)90006-9
German, J. S., and D’Imperio, M. (2016). The status of the initial rise as a marker of focus in French. Lang. Speech 59, 165–195. doi: 10.1177/0023830915583082
Grice, M., and Baumann, S. (2007). “An introduction to intonation-functions and models,” in Non-Native Prosody , eds J. Trouvain, and U. Gut (Berlin: Mouton de Gruyter), 25–52
Gussenhoven, C. (1991). “Tone segments in the intonation of Dutch,” in The Berkeley Conference on Dutch Linguistics , eds T. F. Shannon, and J.P. Snapper (Lanham, MD: University Press of America), 139–155.
Gussenhoven, C. (2004). The Phonology of Tone and Intonation. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511616983
Halliday, M. A. K. (1967). Intonation and Grammar in British English. The Hague: Mouton. doi: 10.1515/9783111357447
Halliday, M. A. K. (1970). A Course in Spoken English: Intonation. London: Oxford University Press
Heffner, C. C., and Slevc, L. R. (2015). Prosodic structure as a parallel to musical structure. Front. Psychol. 6:1962. doi: 10.3389/fpsyg.2015.01962
Hermes, D. J. (2006). Stylization of Pitch Contours. Berlin: Walter de Gruyter. doi: 10.1515/9783110914641.29
Hirschberg, J., and Ward, G. (1995). The interpretation of the high-rise question contour in English. J. Pragmat. 24, 407–412. doi: 10.1016/0378-2166(94)00056-K
Huron, D. (2006). Sweet Anticipation: Music and the Psychology of Expectation. Cambridge, MA: MIT Press.
Juslin, P. N., and Laukka, P. (2003). Communication of emotions in vocal expression and music performance: different channels, same code? Psychol. Bull. 129, 770–814 doi: 10.1037/0033-2909.129.5.770
Karniol, R. (1995). Stuttering, language, and cognition: a review and a model of stuttering as suprasegmental sentence plan alignment (SPA). Psychon. Bull. 117, 104–124. doi: 10.1037/0033-2909.117.1.104
Ladd, D. R. (2008). Intonational Phonology , 2nd Edn. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511808814
Ladd, D. R., Beckman, M. E., and Pierrehumbert, J. B. (1986). Intonational structure in Japanese and English. Phonology 3, 255–309. doi: 10.1017/S095267570000066X
Ladd, D. R., Faulkner, D., Faulkner, H., and Schepman, A. (1999). Constant “segmental anchoring” of F0 movements under changes in speech rate. J. Acoust. Soc. Am. 106, 1543–1554. doi: 10.1121/1.427151
Leben, L. (1973). Suprasegmental Phonology. Cambridge, MA: MIT Press.
Liberman, M., and Pierrehumbert, J. (1984). “Intonational invariance under changes in pitch range and length,” in Language Sound Structure , eds M. Aronoff, and R. T. Oehrle (Cambridge, MA: MIT Press), 157-233.
Lieberman, P. (1960). Some acoustic correlates of word stress in American English. J. Acoust. Soc. Am. 32, 451–454. doi: 10.1121/1.1908095
Lieberman, P., Katz, W., Jongman, A., Zimmerman, R., and Miller, M. (1985). Measures of the sentence intonation of read and spontaneous speech in American English. J. Acoust. Soc. Am. 77, 649–657 doi: 10.1121/1.391883
Lindsey, G. A. (1985). Intonation and Interrogation: Tonal Structure and the Expression of a Pragmatic Function in English and Other Languages. Ph.D. thesis, University of California, Los Angeles, CA.
Mertens, P. (2004). Quelques aller-retour entre la prosodie et son traitement automatique. Fr. Mod. 72, 39–57.
Mertens, P., and d’Alessandro, C. (1995). “Pitch contour stylization using a tonal perception model,” in Proceedings of the 13th International Congress of Phonetic Sciences , 4, Stockholm, 228–231.
Mithen, S. J. (2005). The Singing Neanderthals: The Origins of Music, Language, Mind and Body. London: Weidenfeld and Nicolson.
Murray, I. R., and Arnott, J. L. (1993). Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. J. Acoust. Soc. Am. 93, 1097–1108. doi: 10.1121/1.405558
Nespor, M., and Vogel, I. (2007). Prosodic Phonology , 2nd ed. Berlin: Walter de Gruyter doi: 10.1515/9783110977790
Nespor, M., and Vogel I. (1986). Prosodic Phonology. Dordrecht: Foris
Nooteboom, S. (1997). The prosody of speech: melody and rhythm. Handb. Phon. Sci. 5, 640–673.
O’Connor, J. D., and Arnold G. F. (1973). Intonation of Colloquial English: A Practical Handbook. London: Longman.
Oxenham, A. J. (2013). “The perception of musical tones,” in Psychology of Music , 3rd Edn, ed. D. Deutsch (Amsterdam: Academic Press), 1–33.
Patel, A. D. (2008). Music, Language and the Brain. Oxford: Oxford University Press.
Petrone, C., and Niebuhr, O. (2014). On the intonation of German intonation questions: the role of the prenuclear region. Lang. Speech 57, 108–146. doi: 10.1177/0023830913495651
Pfordresher, P. Q., and Brown, S. (2017). Vocal mistuning reveals the nature of musical scales. J. Cogn. Psychol. 29, 35–52. doi: 10.1080/20445911.2015.1132024
Pierrehumbert, J. (1999). What people know about sounds of language. Stud. Linguist. Sci. 29, 111–120.
Pierrehumbert, J. B. (1980). The Phonology and Phonetics of English Intonation. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA.
Pisanski, K., Cartei, V., McGettigan, C., Raine, J., and Reby, D. (2016). Voice modulation: a window into the origins of human vocal control? Trends Cogn. Sci. 20, 304–318. doi: 10.1016/j.tics.2016.01.002
Prieto, P. (2015). Intonational meaning. Wiley Interdiscip. Rev. Cogn. Sci. 6, 371–381. doi: 10.1002/wcs.1352
Prom-on, S., Xu, Y., and Thipakorn, B. (2009). Modeling tone and intonation in Mandarin and English as a process of target approximation. J. Acoust. Soc. Am. 125, 405–424. doi: 10.1121/1.3037222
Puts, D. A., Apicella, C. L., and Cárdenas, R. A. (2011). Masculine voices signal men’s threat potential in forager and industrial societies. Proc. R. Soc. Lond. B Biol. Sci. 279, 601–609. doi: 10.1098/rspb.2011.0829
Selkirk, E. (1995). “Sentence prosody: intonation, stress, and phrasing,” in Handbook of Phonological Theory , ed. J. Goldsmith (Cambridge: Blackwell), 550–569.
Steele, J. (1775). An Essay Towards Establishing the Melody and Measure of Speech to be Expressed and Perpetuated by Certain Symbols. London: Bowyer and Nichols.
van der Hulst, H. (1999). “Word stress,” in Word Prosodic Systems in the Languages of Europe , ed. H. van der Hulst (Berlin: Mouton de Gruyter), 3–115.
Vos, P. G., and Troost, J. M. (1989). Ascending and descending melodic intervals: statistical findings and their perceptual relevance. Music Percept. 6, 383–396. doi: 10.2307/40285439
Whalen, D. H., and Levitt, A. G. (1995). The universality of intrinsic F0 of vowels. J. Phon. 23, 349–366. doi: 10.1016/S0095-4470(95)80165-0
Xu, Y. (2005). Speech melody as articulatorily implemented communicative functions. Speech Commun. 46, 220–251. doi: 10.1016/j.specom.2005.02.014
Xu, Y. (2011). Speech prosody: a methodological review. J. Speech Sci. 1, 85–115.
Xu, Y., and Xu, C. X. (2005). Phonetic realization of focus in English declarative intonation. J. Phon. 33, 159–197. doi: 10.1016/j.wocn.2004.11.001
Yip, M. (1988). The obligatory contour principle and phonological rules: a loss of identity. Linguist. Inq. 19, 65–100.
Yuan, J., and Liberman, M. (2014). F0 declination in English and Mandarin broadcast news speech. Speech Commun. 65, 67–74. doi: 10.1016/j.specom.2014.06.001
Yuasa, I. P. (2010). Creaky voice: a new feminine voice quality for young urban-oriented upwardly mobile American women? Am. Speech 3, 315–337. doi: 10.1215/00031283-2010-018
Keywords : speech melody, speech prosody, music, phonetics, phonology, language
Citation: Chow I and Brown S (2018) A Musical Approach to Speech Melody. Front. Psychol. 9:247. doi: 10.3389/fpsyg.2018.00247
Received: 03 July 2017; Accepted: 14 February 2018; Published: 05 March 2018.
Reviewed by:
Copyright © 2018 Chow and Brown. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Steven Brown, [email protected]
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
Book categories, collections.
Phonetics for dummies.
Phoneticians transcribe connected speech by marking how words and syllables go together and by following how the melody of language changes. Many phoneticians indicate pauses with special markings (such as [ǀ] for a short break, and [ǁ] for a longer break). You can show changes in with a drawn line called a pitch plot or by more sophisticated schemes, such as the t ones and break indices (ToBI) system. Here are some important terms to know when considering speech melody:
Compounding: When two words come together to form a new meaning (such as "light" and "house" becoming "lighthouse." In such a case, more stress is given to the first than the second part.
Focus: Also known as emphatic stress . When stress is used to highlight part of a phrase or sentence.
Juncture: How words and syllables are connected in language.
Intonational phrase: Also known as a tonic unit , tonic phrase , or tone group . Pattern of pitch changes that matches up in a meaningful way with a part of a sentence.
Lexical stress: When stress plays a word-specific role in language, such as in English where you can't put stress on the wrong "syll ab le."
Sentence-level intonation: The use of spoken pitch to change the meaning of a sentence or phrase. For example, an English statement usually has falling pitch (high to low), while a yes/no question has a rising pitch (low to high).
Stress: Relative emphasis given to certain syllables. In English, a stressed syllable is louder, longer, and higher in sound.
Syllable: Unit of spoken language consisting of a single uninterrupted sound formed by a vowel, diphthong, or syllabic consonant, with optional sounds before or after it.
Tonic syllable: An important concept for many theories of prosody, the syllable that carries the most pitch changes in an intonational phrase.
ToBi: Tone and break indices. A set of conventions used for working with speech prosody. Although originally designed for English, ToBI is now adapted to work with a few other languages.
This article is from the book:.
William F. Katz, PhD, is Professor of Communication Sciences and Disorders in the School of Behavioral and Brain Sciences at the University of Texas at Dallas, where he teaches and directs research in linguistics, speech science, and language disorders. He has pioneered new treatments for speech loss after stroke, and he studies an unusual disorder known as "foreign accent syndrome."
Identify music on-the-go! Download the free SoundHound app.
Discover, search, and play any song
Tap to identify music or sing/hum
Six of the 11 attorneys found by a three-judge federal panel to have judge shopped their clients’ challenge to Alabama’s ban on gender-affirming care for minors complied under protest with an order to submit their public comments on the panel’s ruling.
All of the 11 attorneys subject to U.S. District Court Judge Liles Burke’s order to turn over any public statements did so by the 5 p.m. Saturday deadline, but six said they were doing so under protest or objection.
M. Christian King, one of the attorneys for two of the six lawyers who filed objections to Burke’s order -- Melody H. Eagan and Jeffrey P. Doss -- said his clients had objections on 1st Amendment grounds to the order “as their compliance will have a chilling effect on [their] free speech rights.”
King said the order was also irrelevant to Burke presiding over potential punishment for the 11 attorneys, pointing out that his clients’ firm, Lightfoot, Franklin & White, LLC, released a statement to media but his clients did not in their individual capacities.
The attorneys also objected on due process grounds “as production [of the statements] was ordered without notice or an opportunity to respond,” he wrote in a court filing.
A three-judge panel comprised of administrative judges from Alabama’s three federal court districts found the attorneys “intentionally attempted to direct their cases to a judge they considered favorable and, in particular, to avoid Judge Burke.”
Barry Ragsdale, an attorney representing three of the lawyers involved in the judge shopping allegations -- James Esseks, Carl Charles and LaTisha Faulks -- issued similar 1st Amendment-related objections to King along with other grounds for the lawyers’ objection, including issuing a short deadline to produce the comments.
He also said while his clients’ employers remarked on the allegations, his clients “did not issue any public statements.
“This Court’s order provides no explanation as to why this Court has ordered this extraordinary production, the relevancy to these proceedings, the legal justification for it, or whether the demand for this information is related in any way to the potential sanctions that this Court might impose,” he wrote.
Robert Segall, attorney for lawyer Scott McCoy, said his client agreed with all of the other objections filed by the five other attorneys.
The order “seeks documents and information protected by the First Amendment; creates a chilling effect with respect to the future exercise of First Amendment rights; violates the Due Process Clause of the Fourteenth Amendment to the United States Constitution; and orders the production of documents that are not, and should not be, relevant to any matter pending before the Court,” Seagall wrote.
McCoy, an attorney for the Southern Poverty Law Center, did not personally comment on the judge shopping allegations, Segall wrote, but he submitted an SPLC statement on the matter to Burke “for the purpose of avoiding delay.”
Burke set hearings for Thursday and Friday in federal court in Montgomery for the 11 attorneys to explain why they should not be disciplined.
If you purchase a product or register for an account through a link on our site, we may receive compensation. By using this site, you consent to our User Agreement and agree that your clicks, interactions, and personal information may be collected, recorded, and/or stored by us and social media and other third-party partners in accordance with our Privacy Policy.
Alerts & Newsletters
By providing your information, you agree to our Terms of Use and our Privacy Policy . We use vendors that may also process your information to help provide our services. This site is protected by reCAPTCHA Enterprise and the Google Privacy Policy and Terms of Service apply.
SheKnows.com Lifestyles
Soaps.com is a part of Penske Media Corporation. © 2024 SheMedia, LLC.
All Rights Reserved
Amy Mistretta
Monday, June 24th, 2024
All products and services featured are independently chosen by editors. However, Soaps.com may receive a commission on orders placed through its retail links, and the retailer may receive certain auditable data for accounting purposes.
On June 20, The Bold and the Beautiful ’s John McCook celebrated a big birthday milestone — his 80th — and as a way to show just how much he’s loved, the weekend prior, his family and friends gathered to surprise him with a special party. As the actor’s wife Laurette led him around a corner of their beautiful property, he heard, “Surprise!” from a crowd of people and realized what was going on — and as a reaction, as he smiled with joy, McCook turned around and made a motion to moon the guests.
“A great night celebrating my 80th at our home last weekend with all the people I love,” McCook stated. “Thank you all.”
The CBS soap vet also shared a video from the party, which included a speech from his lovely wife, a music performance by his daughter Molly (ex-Margot; Last Man Standing ) and a flashback to one of McCook’s concerts from back in the day.
And there were even some familiar faces spotted at the event, including Heather Tom (Katie), Don Diamont (Bill) and his wife Cindy Ambuehl, Scott Clifton (Liam), Romy Park (Poppy), Patrick Duffy (ex-Stephen), Alley Mills (ex-Pam; General Hospital ’s Heather) and The Young and the Restless ’ Peter Bergman (Jack) and his wife Mariellen, Melody Thomas Scott and her husband, B&B producer Edward Scott, as well as Lauralee Bell (Christine) and her husband Scott, plus Ray Wise (ex-Ian) and more…
View this post on Instagram A post shared by John T McCook (@johntmccook)
In response to another video from the party, McCook’s former castmate, Daniel McVicar (ex-Clarke) cheered, “Hi John, Happy Birthday! It is great to see you celebrate like that! Big hugs! Ciao Bello!”
We too send the actor our extended birthday wishes!
And hear what McCook had to say about Eric’s many romances by viewing our photo gallery below .
Advertisement
Previous in News
Next in News
Sheknows family:, stylecaster.
The actor opened up about his December 2023 conviction while accepting the perseverance award at the 2024 Hollywood Unlocked Impact Awards
Falen Hardge is a writer-reporter at PEOPLE. She has been writing about entertainment, celebrity relationships and everything in between since 2018.
Arnold Turner/Getty
Jonathan Majors says his faith has been both "tested" and "strengthened" by his assault and harassment convictions.
The actor — who is completing 12 months of a court-ordered domestic violence intervention program as part of his sentencing — reflected on his legal woes while accepting the perseverance award at the 2024 Hollywood Unlocked Impact Awards on Friday, June 21.
Accepting the award onstage at the Beverly Hilton Hotel in Los Angeles, Majors, 34, said, "I reckon folks want to know about this last year. As a Black man in the criminal justice system, I felt anger. I felt sadness, hurt, surprise."
"When they snatched me up out of my apartment in handcuffs, I didn't feel like all that," he said. "I didn't feel like Jonathan Majors. ... I felt like a little scared, weak boy. Despite the support and the evidence that was in my favor. I knew s--- was bad. It was bad because of who I was and what I am."
Later in his speech, Majors acknowledged his "shortcomings," stating, "We live in a world where men, Black men in particular, are propped up as either superheroes or super villains."
"But I've come to realize, me personally, I ain't none of that," he continued. "I'm imperfect. I have shortcomings, I acknowledge them. I love my craft."
The actor also said his "faith has been tested and has been strengthened by this testimony," adding, "I've sat in that pitch black, and what I've learned is that we catch a glimpse of light, you run as hard and as fast as you can towards it."
Majors later thanked his family, his girlfriend Meagan Good and several stars — including Will Smith , Tyler Perry and Whoopi Goldberg — for their support.
Speaking to Good, 42, directly, he said, "I love you with all my strength, with all my heart," adding, "You done carried me so, so, so, so many nights."
Before a final round of thank-yous, Majors wrapped up by saying: "I received this award not just as an acknowledgement that I have persevered, but as a command to be there for others and to help them win and if their trials come."
The fourth annual Hollywood Unlocked Impact Awards were hosted by Tiffany Haddish. Designer Christian Louboutin got the Innovator Award, Fat Joe the Culture Award and Cardi B the Inspiration Award.
When Hollywood Unlocked announced Majors as one of its honorees earlier in June, it said the Perseverance Award is "given to an individual who has shown that no matter what adversity they face, they will continue to aspire to inspire."
It described the actor's legal battle as "setbacks" for Majors, who maintains his innocence and said he is still "determined to earn his respect back from his peers and the industry."
Never miss a story — sign up for PEOPLE's free daily newsletter to stay up-to-date on the best of what PEOPLE has to offer, from celebrity news to compelling human interest stories.
Back in December, a jury found Majors guilty of one count of misdemeanor third-degree assault and one count of second-degree harassment in connection to a 2023 altercation with then-girlfriend Grace Jabbari.
Jabbari, a 31-year-old British actress and dancer, testified as part of the trial , though Majors did not take the stand.
Majors, known for movies like Creed III and Ant-Man and the Wasp: Quantumania, recently lined up his first movie role since his conviction : a revenge thriller titled Merciless that is directed by Martin Villeneuve, the brother of Dune director Denis Villeneuve.
The actor has made several red carpet appearances since he was convicted , including at the Frederick Douglass Awards in New York City on June 6. There, he supported his girlfriend as she was celebrated as an honorary co-chair at the New York Urban League’s annual event.
IMAGES
VIDEO
COMMENTS
Prosody vs. Speech Melody vs. Intonation. Before comparing the three dominant models of speech melody with the musical approach that we are proposing (see next section), we would like to first define the important terms "prosody," "speech melody," and "intonation," and discuss how they relate to one another, since these terms are erroneously taken to be synonymous.
THE PROSODY OF SPEECH: MELODY AND RHYTHM Sieb Nooteboom Research Institute for Language and Speech Utrecht University Trans 10 3512 JK UTRECHT Netherlands 1. INTRODUCTION The word 'prosody' comes from ancient Greek, where it was used for a "song sung with instrumental music". In later times the word was used for the "science of
speech melody. Despite extensive research over the decades, many issues about speech melody are still far from clear. In this paper, I will argue that a better understanding of speech melody may be achieved by jointly consider two basic facts: that speech conveys communicative informa-tion, and that it is produced by an articulatory system.
speech melody: [noun] the intonation of connected speech : the continual rise and fall in pitch of the voice in speech.
Speech is produced by a biomechanical system consisting of articulators and a nervous system that controls them. As in any motor system, its biophysical properties determine its capabilities. In Section 1 will examine some of the properties that have been found critical for our understanding of speech melody. 2.1.
existence of speech melodies in the composer's operas. However, only John Tyrrell has explored the matter in depth, and many basic questions about Janacek's speech-. melody theory and practice remain unanswered.1 What follows is an attempt to investigate in detail one of the most prominent, and most misrepresented, issues of Janiacek opera ...
When talking about pronunciation in language learning we mean the production and perception of the significant sounds of a particular language in order to achieve meaning in contexts of language use. This comprises the production and perception of segmental sounds, of stressed and unstressed syllables, and of the 'speech melody', or intonation.
The PENTA model of speech melody makes a clear separation between the meaning-bearing components of intonation and the primitives ofspeech melody, which are defined purely in form and readily implementable in articulation. Expand. 8. Highly Influential.
Variation in speech melody is an essential component of normal human speech. Equipment was available which could produce a voicing buzz but it is only relatively that such devices have been built which allow the user to come close to realistically mimicking the pitch variation of natural speech. Pitch refers to human perception, that is whether ...
The present contribution is a tutorial on selected asp ects of prosody, the rhythms and melodies of speech, based on a course. of the same name at the Summer School on Contemporary Phonetics and ...
Abstract. We present here a musical approach to speech melody, one that takes advantage of the intervallic precision made possible with musical notation. Current phonetic and phonological approaches to speech melody either assign localized pitch targets that impoverish the acoustic details of the pitch contours and/or merely highlight a few ...
Prosody refers to the melodic and rhythmic aspects of speech, including variations in pitch, loudness, duration, and intonation [26]. Jitter and shimmer are acoustic measures that provide ...
We certainly acknowledge that melody in music and melody in speech differ in certain aspects. For instance, the pitch range in speech is far narrower than in musical melody ([47,48]). Nevertheless, the proposed measure of pitch- and duration-based autocorrelations appears to be a fruitful measure that at least approximately captures melodic ...
Melody in speech. All languages use melody in speech, primarily via rises and falls of the pitch of voice. Such pitch variation is pervasive, offering a wide spectrum of nuance to sentences - an additional layer of meaning. For example, saying "yes" with a rising pitch implies a question (rather than an affirmation). Melody is essential ...
Intonation is the idea that these different pitches across a phrase form a pattern, and that those patterns characterize speech. In American English, statements tend to start higher in pitch and end lower in pitch. You know this if you've seen my video questions vs. statements.In that video, we learned that statements, me, go down in pitch.
Research has shown that melody not only plays a crucial role in music but also in language acquisition processes. Evidence has been provided that melody helps in retrieving, remembering, and memorizing new language material, while relatively little is known about whether individuals who perceive speech as more melodic than others also benefit in the acquisition of oral languages.
Leoš Janáček based vocal melodies in his operas on the concept of nápěvky mluvy (speech melodies)—patterns of speech intonation as they relate to psychological conditions—rather than on a strictly musical basis. He used such melodic motives, characterizing a specific person in a specific dramatic situation, in both vocal and orchestral parts, enabling him to integrate the two parts ...
Likewise, since the melody of speech and language is at the crossroads of many domains—cognitive motor, emotional motor, attentional motor, to name just a few—paying close scrutiny to the manner in which it is disordered will likely be instrumental in the diagnosis and treatment of speech and language disorders. Frank Boutsen is a former ...
Speech-melody complex, I argue, does not stably contain the categories of speech or melody in their full-blown, asserted form, but concerns the ongoingness of the process of categorial molding, which depends on how much contextual information the listeners value in shaping and parsing out the complex. It follows, then, that making a categorial ...
Speech-melody complex, I argue, does not stably contain the categories of speech or melody in their full-blown, asserted form, but concerns the ongoingness of the process of categorial molding, which depends on how much contextual information the listeners value in shaping and parsing out the complex. It follows, then, that making a categorial ...
A Musical Approach to Speech Melody. Ivan Chow Steven Brown *. Department of Psychology, Neuroscience & Behaviour, McMaster University, Hamilton, ON, Canada. We present here a musical approach to speech melody, one that takes advantage of the intervallic precision made possible with musical notation. Current phonetic and phonological approaches ...
Whether it's to pass that big test, qualify for that big promotion or even master that cooking technique; people who rely on dummies, rely on it to learn the critical skills and relevant information necessary for success. Phoneticians transcribe connected speech by marking how words and syllables go together and by following how the melody of ...
midomi.com find and discover music and people. Use your voice to instantly connect to your favorite music, and to a community of people that share your musical interests. Sing your own versions, listen to voices, see pictures, rate singers, send messages, buy music
M. Christian King, one of the attorneys for two of the six lawyers who filed objections to Burke's order -- Melody H. Eagan and Jeffrey P. Doss -- said his clients had objections on 1st ...
"A great night celebrating my 80th at our home last weekend with all the people I love," McCook stated. "Thank you all." The CBS soap vet also shared a video from the party, which included a speech from his lovely wife, a music performance by his daughter Molly (ex-Margot; Last Man Standing) and a flashback to one of McCook's concerts from back in the day.
The State Department introduced eleven global music ambassadors at a time when we are divided at home and abroad. "We can learn from the world," says Chuck D. Art is an expression of optimism ...
Jonathan Majors addressed his assault conviction while accepting the perseverance award at the 2024 Hollywood Unlocked Impact Awards on Friday, June 21. In December 2023, a jury found Majors ...