1- Mommy want cookie.
2- No dinner
3- Drink juice
1- Mommy
2- want
3- Cookie
4- no
5- dinner
6- drink
7- juice
Around the age of 18 months, children’s utterances are usually in two-word forms such as “want that, mommy do, doll fall, etc.” (Vetter [8] . In English, these forms are dominated by content words such as nouns, verbs and adjectives and are restricted to concepts that the child is learning based on their sensorimotor stage as suggested by Piaget (Brown) [31] . Thus, they will express relations between objects, actions and people. This type of speech is called telegraphic speech . During this development stage, children are combining words to convey various meanings. They are also displaying evidence of grammatical structure with consistent word orders and inflections.(Behrens & Gut; [32] Vetter) [8] .
Once the child moves from Stage 1, simple sentences begin to form and the child begins to use inflections and function words (Aoyama et al.) [14] . At this time, the child develops grammatical morphemes (Brown) [31] which are classified into 14 different categories organized by acquisition (See chart below).These morphemes modify the meaning of the utterance such as tense, plurality, possession, etc. There are two theories for why this particular order takes place. The frequency hypothesis suggests that children acquire the morphemes they hear most frequently in adult speech. Brown argued against this theory by analyzing adult speech where articles were the most common word form, yet children did not acquire articles quickly. He suggested that linguistic complexity may account for the order of acquisition where the less complex morphemes were acquired first. Complexity of the morphemes was determined based on semantics (meaning) and/or syntax (rules) of the morpheme. In other words, a morpheme with only one meaning such as plurality (-s) is easier to learn than the copula “is” (which encodes number and time the action occurs). Brown also suggested that for a child to have successfully mastered a grammatical morpheme, they must use it properly 90% of the time.
Order | Morpheme | Example |
---|---|---|
1 | Present progressive ( ) | runn |
2-3 | in, on | sit chair |
4 | Plural ( ) | cookie |
5 | Past irregular | ran,drew |
6 | Possessive ( ) | Daddy' toy |
7 | Uncontractible copula ( ) | That my cookie. |
8 | Articles ( , ) | cat ; "a" dog |
9 | Past regular ( ) | jump |
10 | Third person regular | cook |
11 | Third person irregular | he my toy |
12 | Uncontractible auxiliary ( , ) | you have one? |
13 | Contractible copula ( , | You here |
14 | Contractible auxiliary ( ) | He' coming! |
As children begin to develop more complex sentences, they must learn to use to grammar rules appropriately too. This is difficult in English because of the prevalence of irregular rules. For example, a child may say, “I buyed my toy from the store.” This is known as an overregularization error . The child has understood that there are syntactic patterns and rules to follow, but overuses them, failing to realize that there are exceptions to rules. In the previous example, the child applied a regular part tense rule (-ed) to an irregular verb. Why do these errors occur? It may be that the child does not have a complete understanding of the word meaning and thus incorrectly selects it (Pinker, et al.) [33] . Brooks et al. [34] suggested that these errors may be categorization errors. For example, intransitive or transitive verbs appear in different contexts and thus the child is required to learn that certain verbs appear only in certain contextes. (Brooks) [34] . Interestingly, Hartshorne and Ullman [35] found a gender difference for overregularization errors. Girls were more than three times more likely than boys to produce overregularizations. They concluded that girls were more likely to overgeneralize associatively, whereas boys overgeneralized only through rule-governed methods. In other words, girls, who remember regular forms, better than boys, quickly associated their rule forms to similar sounding words (ie: fold-folded, mold-molded, but they would say hold becomes holded). Boys, on the other hand, will use the regular rule when they have difficulty retrieving the irregular form (ie: past tense form - ed added to irregular form run becomes runed) (Hartshorne & Ullman) [35] .
Another common error committed by children is omission of words from an utterance. These errors are especially prevalent in their early speech production, which frequently lack function words (Gerken, Landau, & Remez) [36] . For example, a child may say “dog eat bone” forgetting function words “the” and “a”.This type of error has been frequently studied and researchers have proposed three main theories to account for omissions. First, it may be that children may focus on words that have referents (Brown) [31] . For example, a child may focus on “car” or “ball”, rather than “jump” or “happy.” The second theory suggests children simply recognize the content words which have greater stress and emphasis (Brown) [31] . The final theory, suggested by Gerken [36] , involves an immature production system. In their study, children could perceive function words and classify them into various syntactic categories, yet still omitted them from their speech production.
In this chapter, the development of speech production was examined in the areas of prelinguistics , phonology , semantics , syntax and morphology . As an infant develops, their vocalizations will undergo a transition from reflexive vocalizations to speech-like sounds and finally words. However, their linguistic development does not end there. Infants underdeveloped speech apparatus restricts them from producing all phonemes properly and thus they produce errors such as consonant cluster reduction , omissions of syllables and assimilation . At 18 months, many children seem to undergo a vocabulary spurt . Even with a larger vocabulary, children may also overextend (calling a horse a doggie) or underextend (not calling the neighbors’ dog, doggie) their words. When a child begins to combine words, they are developing syntax and morphology. Syntactic development is measured using mean length of the utterance (MLU) which is categorized into 5 stages (Brown) [31] . After stage II, children begin to use grammatical morphemes (ie: -ed, -s, is) which encode tense, plurality, etc. As with other areas of linguistic development, children also produce errors such as overregularization (ie: “I buyed it”) or omissions (ie: “dog eat bone”). In spite of children’s early errors patterns, children will eventually develop adult-like speech with few errors. Understanding and studying child language development is an important area of research as it may give us insight into underlying processes of language as well as how we might be able to facilitate it or treat individuals with language difficulties.
1. Watch the video clips of a young boy CC provided below.
Video 1 Video 2 Video 3 Video 4 Video 5
2. The following is a transcription of conversations between a mother (*MOT) and a child (*CHI) from Brown's (1970) corpus. You can ignore the # symbol as it represents unintelligible utterances. Use the charts found in the section on " Grammatical and Morphological Development " to help answer this question.
3. Below are examples of children's speech. These children are displaying some characteristics of terms of we have covered in this chapter. The specfic terms found in each video are provided. Find examples of these terms within their associated video. Indicate which type of development (phonological, semantic, syntactic) is associated with each of these term.
Terms | Video |
---|---|
Dummy Syllable | |
Lexical Innovations | |
Assimilation What kind of learner (conservative or productive)? | |
This child does not produce which two phonemes? ** hint, "camera" and "the" | |
Cluster reduction | |
Overregularization |
5.The following are examples of children’s speech errors. Name the error and the type of development it is associated with (phonological, syntactic, morphological, or semantic). Can you explain why such an error occurs?
Click here!
The source–filter theory of speech.
In the source-filter theory, the mechanism of speech production is described as a two-stage process: (a) The air flow coming from the lungs induces tissue vibrations of the vocal folds (i.e., two small muscular folds located in the larynx) and generates the “source” sound. Turbulent airflows are also created at the glottis or at the vocal tract to generate noisy sound sources. (b) Spectral structures of these source sounds are shaped by the vocal tract “filter.” Through the filtering process, frequency components corresponding to the vocal tract resonances are amplified, while the other frequency components are diminished. The source sound mainly characterizes the vocal pitch (i.e., fundamental frequency), while the filter forms the timbre. The source-filter theory provides a very accurate description of normal speech production and has been applied successfully to speech analysis, synthesis, and processing. Separate control of the source (phonation) and the filter (articulation) is advantageous for acoustic communications, especially for human language, which requires expression of various phonemes realized by a flexible maneuver of the vocal tract configuration. Based on this idea, the articulatory phonetics focuses on the positions of the vocal organs to describe the produced speech sounds.
The source-filter theory elucidates the mechanism of “resonance tuning,” that is, a specialized way of singing. To increase efficiency of the vocalization, soprano singers adjust the vocal tract filter to tune one of the resonances to the vocal pitch. Consequently, the main source sound is strongly amplified to produce a loud voice, which is well perceived in a large concert hall over the orchestra.
It should be noted that the source–filter theory is based upon the assumption that the source and the filter are independent from each other. Under certain conditions, the source and the filter interact with each other. The source sound is influenced by the vocal tract geometry and by the acoustic feedback from the vocal tract. Such source–filter interaction induces various voice instabilities, for example, sudden pitch jump, subharmonics, resonance, quenching, and chaos.
Human speech sounds are generated by a complex interaction of components of human anatomy. Most speech sounds begin with the respiratory system, which expels air from the lungs (figure 1 ). The air goes through the trachea and enters into the larynx, where two small muscular folds, called “vocal folds,” are located. As the vocal folds are brought together to form a narrow air passage, the airstream causes them to vibrate in a periodic manner (Titze, 2008 ). The vocal fold vibrations modulate the air pressure and produce a periodic sound. The produced sounds, when the vocal folds are vibrating, are called “voiced sounds,” while those in which the vocal folds do not vibrate are called “unvoiced sounds.” The air passages above the larynx are called the “vocal tract.” Turbulent air flows generated at constricted parts of the glottis or the vocal tract also contribute to aperiodic source sounds distributed over a wide range of frequencies. The shape of the vocal tract and consequently the positions of the articulators (i.e., jaw, tongue, velum, lips, mouth, teeth, and hard palate) provide a crucial factor to determine acoustical characteristics of the speech sounds. The state of the vocal folds, as well as the positions, shapes, and sizes of the articulators, changes over time to produce various phonetic sounds sequentially.
Figure 1. Concept of the source-filter theory. Airflow from the lung induces vocal fold vibrations, where glottal source sound is created. The vocal tract filter shapes the spectral structure of the source sound. The filtered speech sound is finally radiated from the mouth.
To systematically understand the mechanism of speech production, the source-filter theory divides such process into two stages (Chiba & Kajiyama, 1941 ; Fant, 1960 ) (see figure 1 ): (a) The air flow coming from the lungs induces tissue vibration of the vocal folds that generates the “source” sound. Turbulent noise sources are also created at constricted parts of the glottis or the vocal tract. (b) Spectral structures of these source sounds are shaped by the vocal tract “filter.” Through the filtering process, frequency components, which correspond to the resonances of the vocal tract, are amplified, while the other frequency components are diminished. The source sound characterizes mainly the vocal pitch, while the filter forms the overall spectral structure.
The source-filter theory provides a good approximation of normal human speech, under which the source sounds are only weakly influenced by the vocal tract filter, and has been applied successfully to speech analysis, synthesis, and processing (Atal & Schroeder, 1978 ; Markel & Gray, 2013 ). Independent control of the source (phonation) and the filter (articulation) is advantageous for acoustic communications with language, which requires expression of various phonemes with a flexible maneuver of the vocal tract configuration (Fitch, 2010 ; Lieberman, 1977 ).
There are four main types of sound sources that provide an acoustic input to the vocal tract filter: glottal source, aspiration source, frication source, and transient source (Stevens, 1999 , 2005 ).
The glottal source is generated by the vocal fold vibrations. The vocal folds are muscular folds located in the larynx. The opening space between the left and right vocal folds is called “glottal area.” When the vocal folds are closely located to each other, the airflow coming from the lungs can cause the vocal fold tissues to vibrate. With combined effects of pressure, airflow, tissue elasticity, and collision between the left and right vocal folds, the vocal folds give rise to vibrations, which periodically modulate acoustic air pressure at the glottis. The number of the periodic glottal vibrations per second is called “fundamental frequency ( f o )” and is expressed in Hz or cycles per second. In the spectral space, the glottal source sound determines the strengths of the fundamental frequency and its integer multiples (harmonics). The glottal wave provides sources for voiced sounds such as vowels (e.g., [a],[e],[i],[o],[u]), diphthongs (i.e., combinations of two vowel sounds), and voiced consonants (e.g., [b],[d],[ɡ],[v],[z],[ð],[ʒ],[ʤ], [h],[w],[n],[m],[r],[j],[ŋ],[l]).
In addition to the glottal source, noisy signals also serve as the sound sources for consonants. Here, air turbulence developed at constricted or obstructed parts of the airway contributes to random (aperiodic) pressure fluctuations over a wide range of frequencies. Among such noisy signals, the one generated through the glottis or immediately above the glottis is called “aspiration noise.” It is characterized by a strong burst of breath that accompanies either the release or the closure of some obstruents. “Frication noise,” on the other hand, is generated by forcing air through a supraglottal constriction created by placing two articulators close together (e.g., constrictions between lower lip and upper teeth, between back of the tongue and soft palate, and between side of the tongue and molars) (Shadle, 1985 , 1991 ). When an airway in the vocal tract is completely closed and then released, “transient noise” is generated. By forming a closure in the vocal tract, a pressure is built up in the mouth behind the closure. As the closure is released, a brief burst of turbulence is produced, which lasts for a few milliseconds.
Some speech sounds may involve more than one sound source. For instance, a voiced fricative combines the glottal source and the frication noise. A breathy voice may come from the glottal source and the aspiration noise, whereas voiceless fricatives can combine two noise sources generated at the glottis and at the supralaryngeal constriction. These sound sources are fed into the vocal-tract filter to create speech sounds.
In the source-filter theory, the vocal tract acts as an acoustic filter to modify the source sound. Through this acoustic filter, certain frequency components are passed to the output speech, while the others are attenuated. The characteristics of the filter depend upon the shape of the vocal tract. As a simple case, consider acoustic characteristics of an uniform tube of length L = 17.5 cm , that is, a standard length for a male vocal tract (see figure 2 ). At one end, the tube is closed (as glottis), while, at the other end, it is open (as mouth). Inside of the tube, longitudinal sound waves travel either toward the mouth or toward the glottis. The wave propagates by alternately compressing and expanding the air in the tube segments. By this compression/expansion, the air molecules are slightly displaced from their rest positions. Accordingly, the acoustic air pressure inside of the tube changes in time, depending upon the longitudinal displacement of the air along the direction of the traveling wave. Profile of the acoustic air pressure inside the tube is determined by the traveling waves going to the mouth or to the glottis. What is formed here is the “standing wave,” the peak amplitude profile of which does not move in space. The locations at which the absolute value of the amplitude is minimum are called “nodes,” whereas the locations at which the absolute value of the amplitude is maximum are called “antinodes.” Since the air molecules cannot vibrate much at the closed end of the tube, the closed end becomes a node. The open end of the tube, on the other hand, becomes an antinode, since the air molecules can move freely there. Various standing waves that satisfy these boundary conditions can be formed. In figure 2 , 1 / 4 (purple), 3 / 4 (green), and 5 / 4 (sky blue) waves indicate first, second, and third resonances, respectively. Depending upon the number of the nodes in the tube, wavelengths of the standing waves are determined as λ = 4 L , 4 / 3 L , 4 / 5 L . The corresponding frequencies are obtained as f = c / λ = 490 , 1470, 2450 Hz, where c = 343 m / s represents the sound speed. These resonant frequencies are called “formants” in phonetics.
Figure 2. Standing waves of an uniform tube. For a tube having one closed end (glottis) and one open end (mouth), only odd-numbered harmonics are available. 1 / 4 (purple), 3 / 4 (green), and 5 / 4 (sky blue) waves correspond to the first, second, and third resonances (“ 1 / 4 wave” means 1 / 4 of one-cycle waveform is inside the tube).
Next, consider that a source sound is input to this acoustic tube. In the source sound (voiced source or noise, or both), acoustic energy is distributed in a broad range of frequencies. The source sound induces vibrations of the air column inside the tube and produces a sound wave in the external air as the output. The strength at which an input frequency is output from this acoustic filter depends upon the characteristics of the tube. If the input frequency component is close to one of the formants, the tube resonates with the input and propagates the corresponding vibration. Consequently, the frequency components near the formant frequencies are passed to the output at their full strength. If the input frequency component is far from any of these formants, however, the tube does not resonate with the input. Such frequency components are strongly attenuated and achieve only low oscillation amplitudes in the output. In this way, the acoustic tube, or the vocal tract, filters the source sound. This filtering process can be characterized by a transfer function, which describes dependence of the amplification ratio between the input and output acoustic signals on the frequency. Physically, the transfer function is determined by the shape of the vocal tract.
Finally, the sound wave is radiated from the lips of the mouth and the nose. Their radiation characteristics are also included in the vocal-tract transfer function.
Humans are able to control phonation (source generation) and articulation (filtering process) largely independently. The speech sounds are therefore considered as the response of the vocal-tract filter, into which a sound source is fed. To model such source-filter systems for speech production, the sound source, or excitation signal x t , is often implemented as a periodic impulse train for voiced speech, while white noise is used as a source for unvoiced speech. If the vocal-tract configuration does not changed in time, the vocal-tract filter becomes a linear time-invariant (LTI) system, and the output signal y t can be expressed by a convolution of the input signal x t and the impulse response of the system h t as
where the asterisk denotes the convolution. Equation ( 1 ), which is described in the time domain, can be also expressed in the frequency domain as
The frequency domain formula states that the speech spectrum Y ω is modeled as a product of the source spectrum X ω and the spectrum of the vocal-tract filter H ω . The spectrum of the vocal-tract filter H ω is represented by the product of the vocal-tract transfer function T ω and the radiation characteristics from the mouth and the nose R ω , that is, H ω = T ω R ω .
There exist several ways to estimate the vocal-tract filter H ω . The most popular approach is the inverse filtering, in which autoregressive parameters are estimated from an acoustic speech signal by the method of least-squares (Atal & Schroeder, 1978 ; Markel & Gray, 2013 ). The transfer function can then be recovered from the estimated autoregressive parameters. In practice, however, the inverse-filtering is limited to non-nasalized or slightly nasalized vowels. An alternative approach is based upon the measurement of the vocal tract shape. For a human subject, a cross-sectional area of the vocal tract can be measured by X-ray photography or magnetic resonance imaging (MRI). Once the area function of the vocal tract is obtained, the corresponding transfer function can be computed by the so-called transmission line model, which assumes one-dimensional plane-wave propagation inside the vocal tract (Sondhi & Schroeter, 1987 ; Story et al., 1996 ).
Figure 3. (a) Vocal tract area function for a male speaker’s vowel [a]. (b) Transfer function calculated from the area function of (a). (c) Power spectrum of the source sound generated from Liljencrants-Fant model. (d) Power spectrum of the speech signal generated from the source-filter theory.
As an example to illustrate the source-filter modeling, a sound of vowel /a/ is synthesized in figure 3 . The vocal tract area function of figure 3 (a) was measured from a male subject by the MRI (Story et al., 1996 ). By the transmission line model, the transfer function H ω is obtained as figure 3 (b) . The first and the second formants are located at F 1 = 805 Hz and F 2 = 1205 . By the inverse Fourier transform, the impulse response of the vocal tract system h t is derived. As a glottal source sound, the Liljencrants-Fant synthesize model (Fant et al., 1985 ) is utilized. The fundamental frequency is set to f o = 100 Hz , which gives rise to a sharp peak in the power spectrum in figure 3 (c) . Except for the peaks appearing at higher harmonics of f o , the spectral structure of the glottal source is rather flat. As shown in figure 3 (d) , convolution of the source signal with the vocal tract filter amplifies the higher harmonics of f o located close to the formants.
Since the source-filter modeling captures essence of the speech production, it has been successfully applied to speech analysis, synthesis, and processing (Atal & Schroeder, 1978 ; Markel & Gray, 2013 ). It was Chiba and Kajiyama ( 1941 ) who first explained the mechanisms of speech production based on the concept of phonation (source) and articulation (filter). Their idea was combined with Fant’s filter theory (Fant, 1960 ), which led to the “source-filter theory of vowel production” in the studies of speech production.
So far, the source-filter modeling has been applied only to the glottal source, in which the vocal fold vibrations provide the main source sounds. There are other sound sources, such as the frication noise. In the frication noise, air turbulence is developed at constricted (or obstructed) parts of the airway. Such random source also excites the resonances of the vocal tract in a similar manner as the glottal source (Stevens, 1999 , 2005 ). Its marked difference from the glottal source is that the filter property is determined by the vocal tract shape downstream from the constriction (or obstruction). For instance, if the constriction is at the lips, there exists no cavity downstream from the constriction, and therefore the acoustic source is radiated directly from the mouth opening with no filtering. When the constriction is upstream from the lips, the shape of the airway between the constriction and the lips determines the filter properties. It should be also noted that the turbulent source, generated at the constriction, depends sensitively on a three-dimensional geometry of the vocal tract. Therefore, the three-dimensional shape of the vocal tract (not the one-dimensional shape of the area function) should be taken into account to model the frication noise (Shadle, 1985 , 1991 ).
As an interesting application of the source-filter theory, “resonance tuning” (Sundberg, 1989 ) is illustrated. In female speech, the first and the second formants lie between 300 and 900 Hz and between 900 and 2,800 Hz, respectively. In soprano singing, the vocal pitch can reach to these two ranges. To increase the efficiency of the vocalization at high f o , a soprano singer adjusts the shape of the vocal tract to tune the first or second resonance ( R 1 or R 2 ) to the fundamental frequency f o . When one of the harmonics of the f o coincides with a formant resonance, the resulting acoustic power (and musical success) is enhanced.
Figure 4. Resonance tuning. (a) The same transfer function as figure 3 (b). (b) Power spectrum of the source sound, whose fundamental frequency f o is tuned to the first resonance R 1 of the vocal tract. (c) Power spectrum of the speech signal generated from the source-filter theory. (d) Dependence of the amplification rate (i.e., power ratio between the output speech and the input source) on the fundamental frequency f o .
Figure 4 shows an example of the resonance tuning, in which the fundamental frequency is tuned to the first resonance R 1 of the vowel /a/ as f o = 805 Hz . As recognized in the output speech spectrum (figure 4 (c) ), the vocal tract filter strongly amplifies the fundamental frequency component of the vocal source, while the other harmonics are attenuated. Since only a single frequency component is emphasized, the output speech sounds like a pure tone. Figure 4 (d) shows dependence of the amplification ratio (i.e., the power ratio between the output speech and the input source) on the fundamental frequency f o . Indeed, the power of the output speech is maximized at the resonance tuning point of f o = 805 Hz . Without losing the source power, loud voices can be produced with less effort from the singers and, moreover, they are well perceived in a large concert hall over the orchestra (Joliveau et al., 2004 ).
Despite the significant increase in loudness, comprehensibility is sacrificed. With a strong enhancement of the fundamental frequency f o , its higher harmonics are weakened considerably, making it difficult to perceive the formant structure (figure 4 (c) ). This explains why it is difficult to identify words sung in the high range by sopranos.
The resonance tuning discussed here has been based on the linear convolution of the source and the filter, which are assumed to be independent from each other. In reality, however, the source and the filter interact with each other. Depending upon the acoustic property of the vocal tract, it facilitates the vocal fold oscillations and makes the vocal source stronger. Consequently, this source-filter interaction can make the output speech sound even louder in addition to the linear resonance effect. Such interaction will be explained in more detail in section 4 .
It should be of interest to note that some animals such as songbirds and gibbons utilize the technique of resonance tuning in their vocalizations (Koda et al., 2012 ; Nowicki, 1987 ; Riede et al., 2006 ). It has been found through X-ray filming as well as via heliox experiments that these animals adjust the vocal tract resonance to track the fundamental frequency f o . This may facilitate the acoustic communication by increasing the loudness of their vocalization. Again, higher harmonic components, which are needed to emphasize the formants in human language communications, are suppressed. Whether the animals utilize formants information in their communications is under debate (Fitch, 2010 ; Lieberman, 1977 ) but, at least in this context, production of a loud sound is more advantageous for long-distance alarm calls and pure-tone singing of animals.
The linear source–filter theory, under which speech is represented as a convolution of the source and the filter, is based upon the assumption that the vocal fold vibrations as well as the turbulent noise sources are only weakly influenced by the vocal tract. Such an assumption is, however, valid mostly for male adult speech. The actual process of speech production is nonlinear. The vocal fold oscillations are due to combined effects of pressure, airflow, tissue elasticity, and tissue collision. It is natural that such a complex system obeys nonlinear equations of motion. Aerodynamics inside the glottis and the vocal tract is also governed by nonlinear equations in a strict sense. Moreover, there exists a mutual interaction between the source and the filter (Flanagan, 1968 ; Lucero et al., 2012 ; Rothenberg, 1981 ; Titze, 2008 ; Titze & Alipour, 2006 ). First, the source sound, which is generated from the vocal folds, is influenced by the vocal tract, since the vocal tract determines pressure above the vocal folds to change the aerodynamics of the glottal flow. As described in section 2.3 , the turbulent source is also very sensitive to the vocal tract geometry. Second, the source sound, which then propagates through the vocal tract, is not only radiated from the mouth but is also partially reflected back to the glottis through the vocal tract. Such reflection can influence the vocal fold oscillations, especially when the fundamental frequency or its harmonics is closely located to one of the vocal tract resonances, for instance, in singing. The strong acoustic feedback makes the interrelation between the source and the filter nonlinear and induces various voice instabilities, for example., sudden pitch jump, subharmonics, resonance, quenching, and chaos (Hatzikirou et al., 2006 ; Lucero et al., 2012 ; Migimatsu & Tokuda, 2019 ; Titze et al., 2008 ).
Figure 5. Example of a glissando singing. A male subject glided the fundamental frequency ( f o ) from 120 Hz to 350 Hz and then back. The first resonance ( R 1 = 270 Hz ) is indicated by a black bold line. The pitch jump occurred when f o crossed R 1 .
Figure 5 shows a spectrogram that demonstrates such pitch jump. The horizontal axis represents time, while the vertical axis represents spectral power of a singing voice. In this recording, a male singer glided his pitch in a certain frequency range. Accordingly, the fundamental frequency increases from 120 Hz to 350 Hz and then decreases back to 120 Hz. Around 270Hz, the fundamental frequency or its higher harmonics crosses one of the resonances of the vocal tract (black bold line of figure 5 ), and it jumps abruptly. At such frequency crossing point, acoustic reflection from the vocal tract to the vocal folds becomes very strong and non-negligible. The source-filter interaction has two aspects (Story et al., 2000 ). On one side, the vocal tract acoustics facilitates the vocal fold oscillations and contributes to the production of a loud vocal sound as discussed in the resonance tuning (section 3 ). On the other side, the vocal tract acoustics inhibits the vocal fold oscillations and consequently induces a voice instability. For instance, the vocal folds oscillation can stop suddenly or spontaneously jump to another fundamental frequency as exemplified by the glissando singing of figure 5 . To avoid such voice instabilities, singers must weaken the level of the acoustic coupling, possibly by adjusting the epilarynx, whenever the frequency crossing takes place (Lucero et al., 2012 ; Titze et al., 2008 ).
Summarizing, the source-filter theory has been described as a basic framework to model human speech production. The source is generated from the vocal fold oscillations and/or the turbulent airflows developed above the glottis. The vocal tract functions as a filter to modify the spectral structure of the source sounds. This filtering mechanism has been explained in terms of the resonances of the acoustical tube. Independence between the source and the filter is vital for language-based acoustic communications in humans, which require flexible maneuvering of the vocal tract configuration to express various phonemes sequentially and smoothly (Fitch, 2010 ; Lieberman, 1977 ). As an application of the source-filter theory, resonance tuning is explained as a technique utilized by soprano singers and some animals. Finally, existence of the source-filter interaction has been described. It is inevitable that the source sound is aerodynamically influenced by the vocal tract, since they are closely located to each other. Moreover, acoustic pressure wave reflecting back from the vocal tract to the glottis influences the vocal fold oscillations and can induce various voice instabilities. The source-filter interaction may become strong when the fundamental frequency or its higher harmonics crosses one of the vocal tract resonances, for example, in singing.
Printed from Oxford Research Encyclopedias, Linguistics. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).
date: 11 July 2024
Character limit 500 /500
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Nature ( 2024 ) Cite this article
15k Accesses
265 Altmetric
Metrics details
From sequences of speech sounds 1 , 2 or letters 3 , humans can extract rich and nuanced meaning through language. This capacity is essential for human communication. Yet, despite a growing understanding of the brain areas that support linguistic and semantic processing 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , the derivation of linguistic meaning in neural tissue at the cellular level and over the timescale of action potentials remains largely unknown. Here we recorded from single cells in the left language-dominant prefrontal cortex as participants listened to semantically diverse sentences and naturalistic stories. By tracking their activities during natural speech processing, we discover a fine-scale cortical representation of semantic information by individual neurons. These neurons responded selectively to specific word meanings and reliably distinguished words from nonwords. Moreover, rather than responding to the words as fixed memory representations, their activities were highly dynamic, reflecting the words’ meanings based on their specific sentence contexts and independent of their phonetic form. Collectively, we show how these cell ensembles accurately predicted the broad semantic categories of the words as they were heard in real time during speech and how they tracked the sentences in which they appeared. We also show how they encoded the hierarchical structure of these meaning representations and how these representations mapped onto the cell population. Together, these findings reveal a finely detailed cortical organization of semantic representations at the neuron scale in humans and begin to illuminate the cellular-level processing of meaning during language comprehension.
Humans are capable of communicating exceptionally detailed meanings through language. How neurons in the human brain represent linguistic meaning and what their functional organization may be, however, remain largely unknown. Initial perceptual processing of linguistic input is carried out by regions in the auditory cortex for speech 1 , 2 or visual regions for reading 3 . From there, information flows to the amodal language-selective 9 left-lateralized network of frontal and temporal regions that map word forms to word meanings and assemble them into phrase- and sentence-level representations 4 , 5 , 13 . Processing meanings extracted from language also engages widespread areas outside this language-selective network, with diverging evidence suggesting that semantic processing may be broadly distributed across the cortex 11 or that it may alternatively be concentrated in a few semantic ‘hubs’ that process meaning from language as well as other modalities 7 , 12 . How linguistic and semantic information is represented at the basic computational level of individual neurons during natural language comprehension in humans, however, remains undefined.
Despite a growing understanding of semantic processing from imaging studies, little is known about how neurons in humans process or represent word meanings during language comprehension. Further, although speech processing is strongly context dependent 14 , how contextual information influences meaning representations and how these changes may be instantiated within sentences at a cellular scale remain largely unknown. Finally, although our semantic knowledge is highly structured 15 , 16 , 17 , little is understood about how cells or cell ensembles represent the semantic relationships among words or word classes during speech processing and what their functional organization may be.
Single-neuronal recordings have the potential to begin unravelling some of the real-time dynamics of word and sentence comprehension at a combined spatial and temporal resolution that has largely been inaccessible through traditional human neuroscience approaches 18 , 19 , 20 . Here we used a rare opportunity to record from single cells in humans 18 , 19 , 21 and begin investigating the moment-by-moment dynamics of natural language comprehension at the cellular scale.
Single-neuronal recordings were obtained from the prefrontal cortex of the language-dominant hemisphere in a region centred along the left posterior middle frontal gyrus (Fig. 1a and Methods (‘Acute intraoperative single-neuronal recordings’) and Extended Data Fig. 1a ). This region contains portions of the language-selective network together with several other high-level networks 22 , 23 , 24 , 25 , and has been shown to reliably represent semantic information during language comprehension 11 , 26 . Here recordings were performed in participants undergoing planned intraoperative neurophysiology. Moreover, all participants were awake and therefore capable of performing language-based tasks, providing the unique opportunity to study the action potential dynamics of individual neurons during comprehension in humans.
a , Left: single-neuron recordings were obtained from the left language-dominant prefrontal cortex. Recording locations for the microarray (red) and Neuropixels (beige) recordings (spm12; Extended Data Table 1 ) as well as an approximation of language-selective network areas (brown) are indicated. Right: the action potentials of putative neurons. b , Action potentials (black lines) and instantaneous firing rate (red trace) of each neuron were time-aligned to the onset of each word. Freq., frequency. c , Word embedding approach for identifying semantic domains. Here each word is represented as a 300-dimensional (dim) vector. d , Silhouette criterion analysis (upper) and purity measures (lower) characterized the separability and quality of the semantic domains (Extended Data Fig. 2a ). permut., permutations. e , Peri-stimulus spike histograms (mean ± standard error of the mean (s.e.m.)) and rasters for two representative neurons. The horizontal green bars mark the window of analysis (100–500 ms from onset). sp, spikes. f , Left: confusion matrix illustrating the distribution of cells that exhibited selective responses to one or more semantic domains ( P < 0.05, two-tailed rank-sum test, false discovery rate-adjusted). Spatiotemp., spatiotemporal.; sig. significant. Top right: numbers of cells that exhibited semantic selectivity. g , Left: SI of each neuron ( n = 19) when compared across semantic domains. The SIs of two neurons are colour-coded to correspond to those shown in Fig. 1e . Upper right: mean SI across neurons when randomly selecting words from 60% of the sentences (mean SI = 0.33, CI = 0.32–0.33; across 100 iterations). Bottom right : probabilities of neurons exhibiting significant selectivity to their non-preferred semantic domains when randomly selecting words from 60% of the sentences (1.4 ± 0.5% mean ± s.e.m. different (diff.) domain). h , Relationship between increased meaning specificity (by decreasing the number of words on the basis of the words’ distance from each domain’s centroid) and response selectivity. The lines with error bars in d , g , h represent mean with 95% confidence limits.
Altogether, we recorded from 133 well-isolated single units (Fig. 1a , right, and Extended Data Fig. 1a,b ) in 10 participants (18 sessions; 8 male and 2 female individuals, age range 33–79; Extended Data Table 1 ) using custom-adapted tungsten microelectrode arrays 27 , 28 , 29 (microarray; Methods (‘Single-unit isolation’)). To further confirm the consistency and robustness of neuronal responses, an additional 154 units in 3 participants (3 sessions; 2 male individuals and 1 female individual; age range 66–70; Extended Data Table 1 ) were also recorded using silicon Neuropixels arrays 30 , 31 ( Methods (‘Single-unit isolation’) and Extended Data Fig. 1c,d ) that allowed for higher-throughput recordings per participant (287 units across 13 participants in total; 133 units from the microarray recordings and 154 units from the Neuropixels recordings). All participants were right-hand-dominant native English speakers and were confirmed to have normal language function by preoperative testing.
During recordings, the participants listened to semantically diverse naturalistic sentences that were played to them in a random order. This amounted to an average of 459 ± 24 unique words or 1,052 ± 106 word tokens (± s.e.m) across 131 ± 13 sentences per participant ( Methods (‘Linguistic materials’) and Extended Data Table 1 ). Additional controls included the presentations of unstructured word lists, nonwords and naturalistic story narratives (Extended Data Table 1 ). Action potential activities were aligned to each word or nonword using custom-made software at millisecond resolution and analysed off-line (Fig. 1b ). All primary findings describe results for the tungsten microarray recordings unless stated otherwise for the Neuropixels recordings (Extended Data Fig. 1 ).
A long-standing observation 32 that lies at the core of all distributional models of meaning 33 is that words that share similar meanings tend to occur in similar contexts. Data-driven word embedding approaches that capture these relationships through vectoral representations 11 , 34 , 35 , 36 , 37 , 38 , 39 have been found to estimate word meanings quite well and to accurately capture human behavioural semantic judgements 40 and neural responses to meaning through brain-imaging studies 11 , 26 , 37 , 39 , 41 .
To first examine whether and to what degree the activities of neurons within the population reflected the words’ meanings during speech processing, we used an embedding approach that replaced each unique word heard by the participants with pretrained 300-dimensional embedding vectors extracted from a large English corpus ( Methods (‘Word embedding and clustering procedures’)) 35 , 37 , 39 , 42 . Thus, for instance, the words ‘clouds’ and ‘rain’, which are closely related in meaning, would share a smaller vectoral cosine distance in this embedding space when compared to ‘rain’ and ‘dad’ (Fig. 1c , left). Next, to determine how the words optimally group into semantic domains, we used a spherical clustering and silhouette criterion analysis 34 , 37 , 43 , 44 to reveal the following nine putative domains: actions (for example, ‘walked’, ‘ran’ and ‘threw’), states (for example, ‘happy’, ‘hurt’ and ‘sad’), objects (for example, ‘hat’, ‘broom’ and ‘lampshade’), food (for example, ‘salad’, ‘carrots’ and ‘cake’), animals (for example, ‘bunny’, ‘lizard’ and ‘horse’), nature (for example, ‘rain’, ‘clouds’ and ‘sun’), people and family (for example, ‘son’, ‘sister’ and ‘dad’), names (for example, ‘george’, ‘kevin’ and ‘hannah’) and spatiotemporal relationships (for example, ‘up’, ‘down’ and ‘behind’; Fig. 1c right and Extended Data Tables 2 and 3 ). Purity and d ′ measures confirmed the quality and separability of these word clusters (Fig. 1d and Extended Data Fig. 2a ).
We observed that many of the neurons responded selectively to specific word meanings. The selectivity or ‘tuning’ of neurons reflects the degree to which they respond to words denoting particular meanings (that is, words that belong to specific semantic domains). Thus, a selectivity index (SI) of 1.0 would indicate that a cell responded to words within only one semantic domain and no other, whereas an SI of 0 would indicate no selectivity (that is, similar responses to words across all domains; Methods (‘Evaluating the responses of neurons to semantic domains’)). Altogether, 14% ( n = 19 of 133; microarray) of the neurons responded selectively to specific semantic domains indicating that their firing rates distinguished between words on the basis of their meanings (two-tailed rank-sum test comparing activity for each domain to that of all other domains; false discovery rate-corrected for the 9 domains, P < 0.05). Thus, for example, a neuron may respond selectively to ‘food’ items whereas another may respond selectively to ‘objects’ (Fig. 1e ). The domain that elicited the largest change in activity for the largest number of cells was that of ‘actions’, and the domain that elicited changes for the fewest cells was ‘spatiotemporal relations’ (Fig. 1f ). The mean SI across all selective neurons was 0.32 ( n = 19; 95% confidence interval (CI) = 0.26–0.38; Fig. 1g , left) and progressively increased as the semantic domains became more specific in meaning (that is, when removing words that lay farther away from the domain centroid; analysis of variance, F (3,62) = 8.66, P < 0.001; Fig. 1h and Methods (‘Quantifying the specificity of neuronal response’)). Findings from the Neuropixels recordings were similar, with 19% ( n = 29 of 154; Neuropixels) of the neurons exhibiting semantic selectivity (mean SI = 0.42, 95% CI = 0.36–0.48; Extended Data Fig. 3a,b ), in aggregate, providing a total of 48 of 287 semantically selective neurons across the 13 participants. Many of the neurons across the participants and recording techniques therefore exhibited semantic selectivity during language comprehension.
Most of the neurons that exhibited semantic selectivity responded to only one semantic domain and no other. Of the neurons that demonstrated selectivity, 84% ( n = 16; microarray) responded to one of the nine domains, with only 16% ( n = 3) showing response selectivity to two domains (two-sided rank-sum test, P < 0.05; Fig. 1f , top right). The response selectivity of these neurons was also robust to analytic choice, demonstrating a similarly high degree of selectivity when randomly sub-selecting words (SI = 0.33, CI = 0.32–0.33, rank-sum test when compared to the original SI values, z value = 0.44, P = 0.66, Fig. 1g , top right, and Methods (‘Evaluating the responses of neurons to semantic domains’)) or when selecting words that intuitively fit within their respective domains (SI = 0.30; rank-sum test compared to the original SI values, z value = 0.60, P = 0.55; Extended Data Fig. 2b and Extended Data Table 2 ). Moreover, they exhibited a similarly high degree of selectivity when selecting nonadjacent content words (SI = 0.34, CI = 0.26–0.42; Methods ), further confirming the consistency of neuronal response.
Finally, given these findings, we tested whether the neurons distinguished real words from nonwords (such as ‘blicket’ or ‘florp’, which sound like words but are meaningless), as might be expected of cells that represent meaning. Here we found that many neurons distinguished words from nonwords (27 of 48 neurons; microarray, in 7 participants for whom this control was carried out; two-tailed t -test, P < 0.05; Methods (‘Linguistic materials; Nonwords’)), meaning that they exhibited a consistent difference in their activities. Moreover, the ability to differentiate words from nonwords was not necessarily restricted to semantically selective neurons (Extended Data Fig. 3f , Neuropixels, and Extended Data Fig. 4 , microarray), together revealing a broad mixture of response selectivity to word meanings within the cell population.
Meaning representations by the semantically selective neurons were robust. Training multi-class decoders on the combined response patterns of the semantically selective cells, we found that these cell ensembles could reliably predict the semantic domains of randomly selected subsets of words not used for training (31 ± 7% s.d.; chance: 11%, permutation test, P < 0.01; Fig. 2a and Methods (‘Model decoding performance and the robustness of neuronal response’)). Moreover, similar decoding performances were observed when using a different embedding model (GloVe 45 ; 25 ± 5%; permutation test, P < 0.05; Fig. 2b ) or when selecting different recorded time points within the sentences (that is, the first half versus the second half of the sentences; Extended Data Fig. 5a ). Similar decoding performances were also observed when randomly subsampling neurons from across the population (Extended Data Fig. 5c–e ), or when examining multi-unit activities for which no spike sorting was carried out (permutation test, P < 0.05; Methods (‘Multi-unit isolation’) and Extended Data Fig. 5b ). In tandem, these analyses therefore suggested that the words’ meanings were robustly represented within the population’s response patterns.
a , Left: projected probabilities of correctly predicting the semantic domain to which individual words belonged over a representative sentence. Right: the cumulative decoding performance (±s.d.) of all semantically selective neurons during presentation of sentences (blue) versus chance (orange); see also Extended Data Fig. 4b . b , Decoding performances (±s.d.) across two independent embedding models (Word2Vec and GloVe). c , Left: the absolute difference in neuronal responses ( n = 115) for homophone pairs that sounded the same but differed in meaning (red) compared to that of non-homophone pairs that sounded different but shared similar meanings (blue; two-sided permutation test). Right: scatter plot displaying each neuron’s absolute difference in activity for homophone versus non-homophone pairs ( P < 0.0001, one-sided t -test comparing linear fit to identity line). d , Peri-stimulus spike histogram (mean ± s.e.m.) and raster from a representative neuron when hearing words within sentences (top) compared to words within random word lists (bottom). The horizontal green bars mark the window of analysis (100–500 ms from onset). e , Left: SI distributions for neurons during word-list and sentence presentations together with the number of neurons that responded selectivity to one or more semantic domains (inset). Right: the SI for neurons (mean with 95% confidence limit, n = 9; excluding zero firing rate neurons) during word-list presentation. These neurons did not exhibit changes in mean firing rates when comparing all sentences versus word lists independently of semantic domains (rank-sum test, P = 0.16).
We also examined whether the activities of the neurons could be generalized to an entirely new set of naturalistic narratives. Here, for three of the participants, we additionally introduced short story excerpts that were thematically and stylistically different from the sentences and that contained new words (Extended Data Table 1 ; 70 unique words of which 28 were shared with the sentences). We then used neuronal activity recorded during the presentation of sentences to decode semantic domains for words heard during these stories ( Methods (‘Linguistic materials; Story narratives’)). We find that, even when using this limited subset of semantically selective neurons ( n = 9; microarray), models that were originally trained on activity recorded during the presentation of sentences could predict the semantic domains of words heard during the narratives with significant accuracy (28 ± 5%; permutation test, P < 0.05; Extended Data Fig. 6 ).
Finally, to confirm the consistency of these semantic representations, we evaluated neuronal responses across the different participants and recording techniques. Here we found similar results across individuals (permutation test, P < 0.01) and clinical conditions ( χ 2 = 2.33, P = 0.31; Methods (‘Confirming the robustness of neuronal response across participants’) and Extended Data Fig. 2c–f ), indicating that the results were not driven by any single participant or a small subset of participants. We also evaluated the consistency of semantic representations in the three participants who underwent Neuropixels recordings and found that the activities of semantically selective neurons in these participants could be used to reliably predict the semantic domains of words not used for model fitting (29 ± 7%; permutation test, P < 0.01; Extended Data Fig. 3c ) and that they were comparable across embedding models (GloVe; 30 ± 6%). Collectively, decoding performance across the 13 participants (48 of 287 semantically selective neurons in total) was 36 ± 7% and significantly higher than expected from chance (permutation test, P < 0.01; Methods ). These findings therefore together suggested that these meaning representations by semantically selective neurons were both generalizable and robust.
An additional core property of language is our ability to interpret words on the basis of the sentence contexts in which they appear 46 , 47 . For example, hearing the sequences of words “He picked the rose…” versus “He finally rose…” allows us to correctly interpret the meaning of the ambiguous word ‘rose’ as a noun or a verb. It also allows us to differentiate homophones—words that sound the same but differ in meaning (such as ‘sun’ and ‘son ’ )—on the basis of their contexts.
Therefore, to first evaluate the degree to which the meaning representations by neurons are sentence context dependent, seven of the participants were presented with a word-list control that contains the same words as those heard in the sentences but were presented in random order (for example, “to pirate with in bike took is one”; Extended Data Table 1 ), thus largely removing the influence of context on lexical (word-level) processing. Here we find that, the SI of the neurons that exhibited semantic selectivity in the sentence condition dropped from a mean of 0.34 ( n = 9 cells; microarray, CI = 0.25–0.43) to 0.19 (CI = 0.07–0.31) during the word-list presentation (signed-rank test, z (17) = 40, P = 0.02; Fig. 2d,e ), in spite of similar mean population firing rate 48 (two-sided rank-sum test, z value = 0.10, P = 0.16). The results were similar for the Neuropixels recordings, for the SI dropped from 0.39 (CI = 0.33–0.45) during the presentation of sentences to 0.29 (CI = 0.19–0.39) during word-list presentation (Extended Data Fig. 3e ; signed-rank test, z (41) = 168, P = 0.035). These findings therefore suggested that the response selectivity of these neurons was strongly influenced by the word’s context and that these changes were independent of potential variations in attentional engagement, as evidenced by similar overall firing rates between the sentences and word lists 48 .
Second, to test whether the neurons’ activity reflected the words’ meanings independently of their word-form similarity, we used homophone pairs that are phonetically identical but differ in meaning (for example, ‘sun’ versus ‘son’; Extended Data Table 1 ). Here we find that neurons across the population exhibited a larger difference in activity for words that sounded the same but had different meanings (that is, homophones) compared to words that sounded different but belonged to the same semantic domain (permutation test, P < 0.0001; n = 115 cells; microarray, for which data were available; Figs. 2c and 3a and Methods (‘Evaluating the context dependency of neuronal response using homophone pairs’)). These neurons therefore encoded the words’ meanings independently of their sound-level similarity.
a , Differences in neuronal activity comparing homophone (for example, ‘son’ and ‘sun’; blue) to non-homophone (for example, ‘son’ and ‘dad’; red) pairs across participants using a participant-dropping procedure (two-sided paired t -test, P < 0.001 for all participants). b , Left: decoding accuracies for words that showed high versus low surprisal based on the preceding sentence contexts in which they were heard. Words with lower surprisal were more predictable on the basis of their preceding word sequence. Actual and chance decoding performances are shown in blue and orange, respectively (mean ± s.d., one-sided rank-sum test, z value = 26, P < 0.001). Right: a regression analysis on the relation between decoding performance and surprisal.
Last, we quantified the degree to which the words’ meanings could be predicted from the sentences in which they appeared. Here we reasoned that words that were more likely to occur on the basis of their preceding word sequence and context should be easier to decode. Using a long short-term memory model to quantify each word’s surprisal based on its sentence context ( Methods (‘Evaluating the context dependency of neuronal response using surprisal analysis’)), we find that decoding accuracies for words that were more predictable were significantly higher than for words that were less predictable (comparing top versus bottom deciles; 26 ± 14% versus 10 ± 9% respectively, rank-sum test, z value = 26, P < 0.0001; Fig. 3b ). Similar findings were also obtained from the Neuropixels recordings (rank-sum test, z value = 25, P < 0.001; Extended Data Fig. 3d ), indicating that information about the sentences was being tracked and that it influenced neuronal response. These findings therefore together suggested that the activities of these neurons were dynamic, reflecting processing of the words’ meanings based on their specific sentence contexts and independently of their phonetic form.
The above observations suggested that neurons within the population encoded information about the words’ meanings during comprehension. How they may represent the higher-order semantic relationships among words, however, remained unclear. Therefore, to further probe the organization of neuronal representations of meaning at the level of the cell population, we regressed the responses of the neurons ( n = 133) onto the embedding vectors of all words in the study vocabulary (that is, a matrix of n words × 300 embedding dimensions), resulting in a set of model weights for the neurons (Fig. 4a , left, and Methods (‘Determining the relation between the word embedding space and neural response’)). These model weights were then concatenated (dimension = 133 × 300) to define a putative neuronal–semantic space. Each model weight can therefore be interpreted as the contribution of a particular dimension in the embedding space to the activity of a given neuron, such that the resulting transformation matrix reflects the semantic relationships among words as represented by the population 11 , 34 , 37 .
a , Left: the activity of each neuron was regressed onto 300-dimensional word embedding vectors. A PC analysis was then used to dimensionally reduce this space from the concatenated set model parameters such that the cosine distance between each projection reflected the semantic relationship between words as represented by the neural population. Right: PC space with arrows highlighting two representative word projections. The explained variance and correlation between cosine distances for word projections derived from the word embedding space versus neural data ( n = 258,121 possible word pairs) are shown in Extended Data Fig. 7a,b . b , Left: activities of neurons for word pairs based on their vectoral cosine distance within the 300-dimensional embedding space ( z -scored activity change over percentile cosine similarity, red regression line; Pearson’s correlation, r = 0.17). Right: correlation between vectoral cosine distances in the word embedding space and difference in neuronal activity across possible word pairs (orange) versus chance distribution (grey, n = 1,000, P = 0.02; Extended Data Fig. 7c ). c , Left: scatter plot showing the correlation between population-averaged neuronal activity and the cophenetic distances between words ( n = 100 bins) derived from the word embedding space (red regression line; Pearson’s correlation, r = 0.36). Right: distribution of correlations between cophenetic distances and neuronal activity across the different participants ( n = 10).
Applying a principal component (PC) analysis to these weights, we find that the first five PCs accounted for 46% of the variance in neural population activity (Fig. 4a right and Extended Data Fig. 7a ) and 81% of the variance for the semantically selective neurons (Extended Data Fig. 3g for the Neuropixels recordings). Moreover, when projecting words back into this PC space, we find that the vectoral distances between neuronal projections significantly correlated with the dimensionally reduced word distances in the original word embeddings (258,121 possible word pairings; r = 0.04, permutation test, P < 0.0001; Extended Data Fig. 7b ). Significant correlations between word similarity and neuronal activity were also observed when using a non-embedding approach based on the ‘synset’ similarity metric (WordNet; r = −0.76, P = 0.001; Extended Data Fig. 7d ) as well as when comparing the vectoral distances in the word embeddings to the raw firing activities of the neurons ( r = 0.17; permutation test, one-sided, P = 0.02, Fig. 4b and Extended Data Fig. 7c for microarray recordings and r = 0.21; Pearson’s correlation, P < 0.001; Extended Data Fig. 3h for Neuropixels recordings). Our findings therefore suggested that these cell populations reliably captured the semantic relationships among words.
Finally, to evaluate whether and to what degree neuronal activity reflected the hierarchical semantic relationship between words, we compared differences in firing activity for each word pair to the cophenetic distances between those words 49 , 50 , 51 in the 300-dimension word embedding space ( Methods (‘Estimating the hierarchical structure and relation between word projections’)). Here the cophenetic distance between a pair of words reflects the height of the dendrogram where the two branches that include these two words merge into a single branch. Using an agglomerative hierarchical clustering procedure, we find that the activities of the semantically selective neurons closely correlated with the cophenetic distances between words across the study vocabulary ( r = 0.38, P = 0.004; Fig. 4c ). Therefore, words that were connected by fewer links in the hierarchy and thus more likely to share semantic features (for example, ‘ducks’ and ‘eggs’) 50 , 51 elicited smaller differences in activity than words that were connected by a larger number of links (for example, ‘eggs’ and ‘doorbell’; Fig. 5 and Methods (‘ t -stochastic neighbour embedding procedure’)). These results therefore together suggested that these cell ensembles encoded richly detailed information about the hierarchical semantic relationship between words.
a , An agglomerative hierarchical clustering procedure was carried out on all word projections in PC space obtained from the neuronal population data. The dendrogram shows representative word projections, with the branches truncated to allow for visualization. Words that were connected by fewer links in the hierarchy have a smaller cophenetic distance. b , A t -stochastic neighbour embedding procedure was used to visualize all word projections (in grey) by collapsing them onto a common two-dimensional manifold. For comparison, representative words are further colour-coded on the basis of their original semantic domain assignments in Fig. 1c .
Neurons are the most basic computational units by which information is encoded in the brain. Yet, despite a growing understanding of the neural substrates of linguistic 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 and semantic processing 11 , 37 , 41 , understanding how individual neurons represent semantic information during comprehension in humans has largely remained out of reach. Here, using single-neuronal recordings during natural speech processing, we discover cells in the prefrontal cortex of the language-dominant hemisphere that responded selectively to particular semantic domains and that exhibited preferential responses to specific word meanings. More notably, the combined activity patterns of these neurons could be used to accurately decode the semantic domain to which the words belonged even when tested across entirely different linguistic materials (that is, story narratives), suggesting a process that could allow semantic information to be reliably extracted during comprehension at the cellular scale. Lastly, to understand language, the meanings of words likely need to be robustly represented within the brain, entailing not only similar representations for words that share semantic features (for example, ‘mouse’ and ‘rat’) but also sufficiently distinct representations for words that differ in meaning (for example, ‘mouse’ and ‘carrot’). Here we find a putative cellular process that could support such robust word meaning representations during language comprehension.
Collectively, these findings imply that focal cortical areas such as the one from which we recorded here may be potentially able to represent complex meanings largely in their entirety. Although we sampled cells from a relatively restricted prefrontal region of the language-dominant hemisphere, these cell populations were capable of decoding meanings—at least at a relatively coarse level of semantic granularity—of a large set of diverse words and across independent sets of linguistic materials. The responses of these cell ensembles also harboured detailed information about the hierarchical relationship between words across thousands of word pairs, suggesting a cellular mechanism that could allow semantic information to be rapidly mapped onto the population’s response patterns, in real time during speech.
Another notable observation from these recordings is that the activities of the neurons were highly context dependent, reflecting the words’ meanings based on the specific sentences in which they were heard even when they were phonetically indistinguishable. Sentence context is essential to our ability to hone in on the precise meaning or aspects of meaning needed to infer complex ideas from linguistic utterances, and is proposed to play a key role in language comprehension 46 , 47 , 52 . Here we find that the neurons’ responses were highly dynamic, reflecting the meaning of the words within their respective contexts, even when the words were identical in form. Loss of sentence context or less predictive contexts, on the other hand, diminished the neurons’ ability to differentiate among semantic representations. Therefore, rather than simply responding to words as fixed stored memory representations, these neurons seemed to adaptively represent word meanings in a context-dependent manner during natural speech processing.
Taken together, these findings reveal a highly detailed representation of semantic information within prefrontal cortical populations, and a cellular process that could allow the meaning of words to be accurately decoded in real time during speech. As the present findings focus on auditory language processing, however, it is also interesting to speculate whether these semantic representations may be modality independent, generalizing to reading comprehension 53 , 54 , or even generalize to non-linguistic stimuli, such as pictures or videos or nonspeech sounds. Further, it remains to be discovered whether similar semantic representations would be observed across languages, including in bilingual speakers, and whether accessing word meanings in language comprehension and production would elicit similar responses (for example, whether the representations would be similar when participants understand the word ‘sun’ versus produce the word ‘sun’). It is also unknown whether similar semantic selectivity is present across other parts of the brain such as the temporal cortex, how finer-grained distinctions are represented, and how representations of specific words are composed into phrase- and sentence-level meanings.
Our study provides an initial framework for studying linguistic and semantic processing during comprehension at the level of individual neurons. It also highlights the potential benefit of using different recording techniques, linguistic materials and analytic techniques to evaluate the generalizability and robustness of neuronal responses. In particular, our study demonstrates that findings from the two recording approaches (tungsten microarray recordings and Neuropixels recordings) were highly concordant and suggests a platform from which to begin carrying out similar comparisons (especially in light of the increasing emphasis on robustness and replicability in the field). Collectively, our findings provide evidence of single neurons that encode word meanings during comprehension and a process that could support our ability to derive meaning from speech —opening the door for addressing a multitude of further questions about human-unique communicative abilities.
All procedures and studies were carried out in accordance with the Massachusetts General Hospital Institutional Review Board and in strict adherence to Harvard Medical School guidelines. All participants included in the study were scheduled to undergo planned awake intraoperative neurophysiology and single-neuronal recordings for deep brain stimulation targeting. Consideration for surgery was made by a multidisciplinary team including neurologists, neurosurgeons and neuropsychologists 18 , 19 , 55 , 56 , 57 . The decision to carry out surgery was made independently of study candidacy or enrolment. Further, all microelectrode entry points and placements were based purely on planned clinical targeting and were made independently of any study consideration.
Once and only after a patient was consented and scheduled for surgery, their candidacy for participation in the study was reviewed with respect to the following inclusion criteria: 18 years of age or older, right-hand dominant, capacity to provide informed consent for study participation and demonstration of English fluency. To evaluate for language comprehension and the capacity to participate in the study, the participants were given randomly sampled sentences and were then asked questions about them (for example, “Eva placed a secret message in a bottle” followed by “What was placed in the bottle?”). Participants not able to answer all questions on testing were excluded from consideration. All participants gave informed consent to participate in the study and were free to withdraw at any point without consequence to clinical care. A total of 13 participants were enrolled (Extended Data Table 1 ). No participant blinding or randomization was used.
Acute intraoperative single-neuronal recordings.
Microelectrode recording were performed in participants undergoing planned deep brain stimulator placement 19 , 58 . During standard intraoperative recordings before deep brain stimulator placement, microelectrode arrays are used to record neuronal activity. Before clinical recordings and deep brain stimulator placement, recordings were transiently made from the cortical ribbon at the planned clinical placement site. These recordings were largely centred along the superior posterior middle frontal gyrus within the dorsal prefrontal cortex of the language-dominant hemisphere. Here each participant’s computed tomography scan was co-registered to their magnetic resonance imaging scan, and a segmentation and normalization procedure was carried out to bring native brains into Montreal Neurological Institute space. Recording locations were then confirmed using SPM12 software and were visualized on a standard three-dimensional rendered brain (spm152). The Montreal Neurological Institute coordinates for recordings are provided in Extended Data Table 1 , top.
We used two main approaches to perform single-neuronal recordings from the cortex 18 , 19 . Altogether, ten participants underwent recordings using tungsten microarrays (Neuroprobe, Alpha Omega Engineering) and three underwent recordings using linear silicon microelectrode arrays (Neuropixels, IMEC). For the tungsten microarray recordings, we incorporated a Food and Drug Administration-approved, biodegradable, fibrin sealant that was first placed temporarily between the cortical surface and the inner table of the skull (Tisseel, Baxter). Next, we incrementally advanced an array of up to five tungsten microelectrodes (500–1,500 kΩ; Alpha Omega Engineering) into the cortical ribbon at 10–100 µm increments to identify and isolate individual units. Once putative units were identified, the microelectrodes were held in position for a few minutes to confirm signal stability (we did not screen putative neurons for task responsiveness). Here neuronal signals were recorded using a Neuro Omega system (Alpha Omega Engineering) that sampled the neuronal data at 44 kHz. Neuronal signals were amplified, band-pass-filtered (300 Hz and 6 kHz) and stored off-line. Most individuals underwent two recording sessions. After neural recordings from the cortex were completed, subcortical neuronal recordings and deep brain stimulator placement proceeded as planned.
For the silicon microelectrode recordings, sterile Neuropixels probes 31 (version 1.0-S, IMEC, ethylene oxide sterilized by BioSeal) were advanced into the cortical ribbon with a manipulator connected to a ROSA ONE Brain (Zimmer Biomet) robotic arm. The probes (width: 70 µm, length: 10 mm, thickness: 100 µm) consisted of 960 contact sites (384 preselected recording channels) that were laid out in a chequerboard pattern. A 3B2 IMEC headstage was connected via a multiplexed cable to a PXIe acquisition module card (IMEC), installed into a PXIe chassis (PXIe-1071 chassis, National Instruments). Neuropixels recordings were performed using OpenEphys (versions 0.5.3.1 and 0.6.0; https://open-ephys.org/ ) on a computer connected to the PXIe acquisition module recording the action potential band (band-pass-filtered from 0.3 to 10 kHz, sampled at 30 kHz) as well as the local field potential band (band-pass-filtered from 0.5 to 500 Hz, sampled at 2,500 Hz). Once putative units were identified, the Neuropixels probe was held in position briefly to confirm signal stability (we did not screen putative neurons for speech responsiveness). Additional description of this recording approach can be found in refs. 20 , 30 , 31 . After completing single-neuronal recordings from the cortical ribbon, the Neuropixels probe was removed, and subcortical neuronal recordings and deep brain stimulator placement proceeded as planned.
For the tungsten microarray recordings, putative units were identified and sorted off-line through a Plexon workstation. To allow for consistency across recording techniques (that is, with the Neuropixels recordings), a semi-automated valley-seeking approach was used to classify the action potential activities of putative neurons and only well-isolated single units were used. Here, the action potentials were sorted to allow for comparable isolation distances across recording techniques 59 , 60 , 61 , 62 , 63 and unit selection with previous approaches 27 , 28 , 29 , 64 , 65 , and to limit the inclusion of multi-unit activity (MUA). Candidate clusters of putative neurons needed to clearly separate from channel noise, display a voltage waveform consistent with that of a cortical neuron, and have 99% or more of action potentials separated by an inter-spike interval of at least 1 ms (Extended Data Fig. 1b,d ). Units with clear instability were removed and any extended periods (for example, greater than 20 sentences) of little to no spiking activity were excluded from the analysis. In total, 18 recording sessions were carried out, for an average of 5.4 units per session per multielectrode array (Extended Data Fig. 1a,b ).
For the Neuropixels recordings, putative units were identified and sorted off-line using Kilosort and only well-isolated single units were used. We used Decentralized Registration of Electrophysiology Data (DREDge; https://github.com/evarol/DREDge ) software and an interpolation approach ( https://github.com/williamunoz/InterpolationAfterDREDge ) to motion correct the signal using an automated protocol that tracked local field potential voltages using a decentralized correlation technique that realigned the recording channels in relation to brain movements 31 , 66 . Following this, we interpolated the continuous voltage data from the action potential band using the DREDge motion estimate to allow the activities of the recorded units to be stably tracked over time. Finally, putative neurons were identified from the motion-corrected interpolated signal using a semi-automated Kilosort spike sorting approach (version 1.0; https://github.com/cortex-lab/KiloSort ) followed by Phy for cluster curation (version 2.0a1; https://github.com/cortex-lab/phy ). Here, an n -trode approach was used to optimize the isolation of single units and limit the inclusion of MUA 67 , 68 . Units with clear instability were removed and any extended periods (for example, greater than 20 sentences) of little to no spiking activity were excluded from analysis. In total, 3 recording sessions were carried out, for an average of 51.3 units per session per multielectrode array (Extended Data Fig. 1c,d ).
To provide comparison to the single-neuronal data, we also separately analysed MUA. These MUAs reflect the combined activities of multiple putative neurons recorded from the same electrodes as represented by their distinct waveforms 57 , 69 , 70 . These MUAs were obtained by separating all recorded spikes from their baseline noise. Unlike for the single units, the spikes were not separated on the basis of their waveform morphologies.
The linguistic materials were given to the participants in audio format using a Python script utilizing the PyAudio library (version 0.2.11). Audio signals were sampled at 22 kHz using two microphones (Shure, PG48) that were integrated into the Alpha Omega rig for high-fidelity temporal alignment with neuronal data. Audio recordings were annotated in semi-automated fashion (Audacity; version 2.3). For the Neuropixels recordings, audio recordings were carried out at a 44 kHz sampling frequency (TASCAM DR-40× 4-channel 4-track portable audio recorder and USB interface with adjustable microphone). To further ensure granular time alignment for each word token with neuronal activity, the amplitude waveform of each session recording and the pre-recorded linguistic materials were cross-correlated to identify the time offset. Finally, for additional confirmation, the occurrence of each word token and its timing was validated manually. Together, these measures allowed for the millisecond-level alignment of neuronal activity with each word occurrence as they were heard by the participants during the tasks.
The participants were presented with eight-word-long sentences (for example, “The child bent down to smell the rose”; Extended Data Table 1 ) that provided a broad sample of semantically diverse words across a wide variety of thematic contents and contexts 4 . To confirm that the participants were paying attention, a brief prompt was used every 10–15 sentences asking them whether we could proceed with the next sentence (the participants generally responded within 1–2 seconds).
Homophone pairs were used to evaluate for meaning-specific changes in neural activity independently of phonetic content. All of the homophones came from sentence experiments in which homophones were available and in which the words within the homophone pairs came from different semantic domains. Homophones (for example, ‘sun’ and ‘son’; Extended Data Table 1 ), rather than homographs, were used as the word embeddings produce a unique vector for each unique token rather than for each token sense.
A word-list control was used to evaluate the effect that sentence context had on neuronal response. These word lists (for example, “to pirate with in bike took is one”; Extended Data Table 1 ) contained the same words as those given during the presentation of sentences and were eight words long, but they were given in a random order, therefore removing any effect that linguistic context had on lexico-semantic processing.
A nonword control was used to evaluate the selectivity of neuronal responses to semantic (linguistically meaningful) versus non-semantic stimuli. Here the participants were given a set of nonwords such as ‘blicket’ or ‘florp’ (sets of eight) that sounded phonetically like words but held no meaning.
Excerpts from a story narrative were introduced at the end of recordings to evaluate for the consistency of neuronal response. Here, instead of the eight-word-long sentences, the participants were given a brief story about the life and history of Elvis Presley (for example, “At ten years old, I could not figure out what it was that this Elvis Presley guy had that the rest of us boys did not have”; Extended Data Table 1 ). This story was selected because it was naturalistic, contained new words, and was stylistically and thematically different from the preceding sentences.
Spectral clustering of semantic vectors.
To study the selectivity of neurons to words within specific semantic domains, all unique words heard by the participants were clustered into groups using a word embedding approach 35 , 37 , 39 , 42 . Here we used 300-dimensional vectors extracted from a pretrained dataset generated using a skip-gram Word2Vec 11 algorithm on a corpus of 100 billion words. Each unique word from the sentences was then paired with its corresponding vector in a case-insensitive fashion using the Python Gensim library (version 3.4.0; Fig. 1c , left). High unigram frequency words (log probability of greater than 2.5), such as ‘a’, ‘an’ or ‘and’, that held little linguistic meaning were removed.
Next, to group words heard by the participants into representative semantic domains, we used a spherical clustering algorithm (v.0.1.7, Python 3.6) that used the cosine distance between their representative vectors. We then carried out a k -means clustering procedure in this new space to obtain distinct word clusters. This approach therefore grouped words on the basis of their vectoral distance, reflecting the semantic relatedness between words 37 , 40 , which has been shown to work well for obtaining consistent word clusters 34 , 71 . Using pseudorandom initiation cluster seeding, the k -means procedure was repeated 100 times to generate a distribution of values for the optimal number of cluster. For each iteration, a silhouette criterion for cluster number between 5 and 20 was calculated. The cluster with the greatest average criterion value (as well as the most frequent value) was 9, which was taken as the optimal number of clusters for the linguistic materials used 34 , 37 , 43 , 44 .
Purity measures and d ′ analysis were used to confirm the quality and separability of the semantic domains. To this end, we randomly sampled from 60% of the sentences across 100 iterations. We then grouped all words from these subsampled sentences into clusters using the same spherical clustering procedure described above. The new clusters were then matched to the original clusters by considering all possible matching arrangements and choosing the arrangement with greatest word overlap. Finally, the clustering quality was evaluated for ‘purity’, which is the percentage of the total number of words that were classified correctly 72 . This procedure is therefore a simple and transparent measure that varies between 0 (bad clustering) to 1 (perfect clustering; Fig. 1d , bottom). The accuracy of this assignment is determined by counting the total number of correctly assigned words and dividing by the total number of words in the new clusters:
in which n is the total number of words in the new clusters, k is the number of clusters (that is, 9), \({\omega }_{i}\) is a cluster from the set of new clusters \(\Omega \) , and \({c}_{j}\) is the original cluster (from the set of original clusters \({\mathbb{C}}\) ) that has the maximum count for cluster \({\omega }_{i}\) . Finally, to confirm the separability of the clusters, we used a standard d ′ analysis. The d ′ metric estimates the difference between vectoral cosine distances for all words assigned to a particular cluster compared to those assigned to all other clusters (Extended Data Fig. 2a ).
The resulting clusters were labelled here on the basis of the preponderance of words near the centroid of each cluster. Therefore, although not all words may seem to intuitively fit within each domain, the resulting semantic domains reflected the optimal vectoral clustering of words based on their semantic relatedness. To further allow for comparison, we also introduced refined semantic domains (Extended Data Table 2 ) in which the words provided within each cluster were additionally manually reassigned or removed by two independent study members on the basis of their subjective semantic relatedness. Thus, for example, under the semantic domain labelled ‘animals’, any word that did not refer to an animal was removed.
Evaluating the responses of neurons to semantic domains.
To evaluate the selectivity of neurons to words within the different semantic domains, we calculated their firing rates aligned to each word onset. To determine significance, we compared the activity of each neuron for words that belonged to a particular semantic domain (for example, ‘food’) to that for words from all other semantic domains (for example, all domains except for ‘food’). Using a two-sided rank-sum test, we then evaluated whether activity for words in that semantic domain was significantly different from activity in all semantic domains, with the P value being false discovery rate-adjusted using a Benjamini–Hochberg method to account for repeated comparisons across all of the nine domains. Thus, for example, when stating that a neuron exhibited significant selectivity to the domain of ‘food’, this meant that it exhibited a significant difference in its activity for words within that domain when compared to all other words (that is, it responded selectively to words that described food items).
Next we determined the SI of each neuron, which quantified the degree to which it responded to words within specific semantic domains compared to the others. Here SI was defined by the cell’s ability to differentiate words within a particular semantic domain (for example, ‘food’) compared to all others and reflected the degree of modulation. The SI for each neuron was calculated as
in which \({{\rm{FR}}}_{{\rm{domain}}}\) is the neuron’s average firing rate in response to words within the considered domain and \({{\rm{FR}}}_{{\rm{other}}}\) is the average firing rate in response to words outside the considered domain. The SI therefore reflects the magnitude of effect based on the absolute difference in activity for each neuron’s preferred semantic domain compared to others. Therefore, the output of the function is bounded by 0 and 1. An SI of 0 would mean that there is no difference in activity across any of the semantic domains (that is, the neuron exhibits no selectivity) whereas an SI of 1.0 would indicate that a neuron changed its action potential activity only when hearing words within one of the semantic domains.
A bootstrap analysis was used to further confirm reliability of each neuron’s SI across linguistic materials in two parts. For the first approach, the words were randomly split into 60:40% subsets (repeated 100 times) and the SI of semantically selective neurons was compared in both subsets of words. For the second, instead of using the mean SI, we calculated the proportion of times that a neuron exhibited selectivity for another category other than their preferred domain when randomly selecting words from 60% of the sentences.
The consistency of neuronal response across analysis windows was confirmed in two parts. The average time interval between the beginning of one word and the next was 341 ± 5 ms. For all primary analysis, neuronal responses were analysed in 400-ms windows, aligned to each word, with a 100-ms time-lag to further account for the evoked response delay of prefrontal neurons. To further confirm the consistency of semantic selectivity, we first examined neuronal responses using 350-ms and 450-ms time windows. Combining recordings across all 13 participants, a similar proportion of cells exhibiting selectivity was observed when varying the window size by ±50 ms (17% and 15%, χ 2 (1, 861) = 0.43, P = 0.81) suggesting that the precise window of analysis did not markedly affect these results. Second, we confirmed that possible overlap between words did not affect neuronal selectivity by repeating our analyses but now evaluated only non-neighbouring content words within each sentence. Thus, for example, for the sentence “The child bent down to smell the rose”, we would evaluate only non-neighbouring words (for example, child, down and so on) per sentence. Using this approach, we find that the SI for non-overlapping windows (that is, every other word) was not significantly different from the original SIs (0.41 ± 0.03 versus 0.38 ± 0.02, t = 0.73, P = 0.47); together confirming that potential overlap between words did not affect the observed selectivity.
To evaluate the degree to which semantic domains could be predicted from neuronal activity on a per-word level, we randomly sampled words from 60% of the sentences and then used the remaining 40% for validation across 1,000 iterations. Only candidate neurons that exhibited significant semantic selectivity and for which sufficient words and sentences were recorded were used for decoding purposes (43 of 48 total selective neurons). For these, we concatenated all of the candidate neurons from all participants together with their firing rates as independent variables, and predicted the semantic domains of words (dependent variable). Support vector classifiers (SVCs) were then used to predict the semantic domains to which the validation words belonged. These SVCs were constructed to find the optimal hyperplanes that best separated the data by performing
in which \(y\in {\left\{1,-1\right\}}^{n}\) , corresponding to the classification of individual words, \(x\) is the neural activity, and \({{\rm{\zeta }}}_{i}=\max \left(0,\,1-{y}_{i}\left(w{x}_{i}-b\right)\right)\) . The regularization parameter C was set to 1. We used a linear kernel and ‘balanced’ class weight to account for the inhomogeneous distribution of words across the different domains. Finally, after the SVCs were modelled on the bootstrapped training data, decoding accuracy for the models was determined by using words randomly sampled and bootstrapped from the validation data. We further generated a null distribution by calculating the accuracy of the classifier after randomly shuffling the cluster labels on 1,000 different permutations of the dataset. These models therefore together determine the most likely semantic domain from the combined activity patterns of all selective neurons. An empirical P value was then calculated as the percentage of permutations for which the decoding accuracy from the shuffled data was greater than the average score obtained using the original data. The statistical significance was determined at P value < 0.05.
To quantify the specificity of neuronal response, we carried out two procedures. First, we reduce the number of words from each domain from 100% to 25% on the basis of their vectoral cosine distance from each of their respective domains’ centroid. Thus, for each domain, words that were closest to its centroid, and therefore most similar in meaning, were kept whereas words farther away were removed. The SIs of the neurons were then recalculated as before (Fig. 1h ). Second, we repeated the decoding procedure but now varied the number of semantic domains from 2 to 20. Thus, a higher number of domains would mean fewer words per domain (that is, increased specificity of meaning relatedness) whereas a smaller number of domains would mean more words per domain. These decoders used 60% of words for model training and 40% for validation (200 iterations). Next, to evaluate the degree to which neuron and domain number led to improvement in decoding performance, models were trained for all combinations of domain numbers (2 to 20) and neuron numbers (1 to 133) using a nested loop. For control comparison, we repeated the decoding analysis but randomly shuffled the relation between neuronal response and each word as above. The percentage improvement in prediction accuracy (PA) for a given domain number ( d ) and neuronal size ( n ) was calculated as
We compared the responses of neurons to homophone pairs to evaluate the context dependency of neuronal response and to further confirm the specificity of meaning representations. For example, if the neurons simply responded to differences in phonetic input rather than meaning, then we should expect to see smaller differences in firing rate between homophone pairs that sounded the same but differed in meaning (for example, ‘sun’ and ‘son’) compared to non-homophone pairs that sounded different but shared similar meaning (for example, ‘son’ and ‘sister’). Here, only homophones that belonged to different semantic domains were included for analysis. A permutation test was used to compare the distributions of the absolute difference in firing rates between homophone pairs (sample x) and non-homophone pairs (sample y) across semantically selective cells ( P < 0.01). To carry out the permutation test, we first calculated the mean difference between the two distributions (sample x and y) as the test statistic. Then, we pooled all of the measurements from both samples into a single dataset and randomly divided it into two new samples x′ and y′ of the same size as the original samples. We repeated this process 10,000 times, each time computing the difference in the mean of x′ and y′ to create a distribution of possible differences under the null hypothesis. Finally, we computed the two-sided P value as the proportion of permutations for which the absolute difference was greater than or equal to the absolute value of the test statistic. A one-tailed t -test was used to further evaluate for differences in the distribution of firing rates for homophones versus non-homophone pairs ( P < 0.001). To allow for comparison, 2 of the 133 neurons did not have homophone trials and were therefore excluded from analysis. An additional 16 neurons were also excluded for lack of response and/or for lying outside (>2.5 times) the interquartile range.
Information theoretic metrics such as ‘surprisal’ define the degree to which a word can be predicted on the basis of its antecedent sentence context. To examine how the preceding context of each word modulated neuronal response on a per-word level, we quantified the surprisal of each word as follows:
in which P represents the probability of the current word ( w ) at position i within a sentence. Here, a pretrained long short-term memory recurrent neural network was used to estimate P ( w i | w 1 …w i −1 ) 73 . Words that are more predictable on the basis of their preceding context would therefore have a low surprisal whereas words that are poorly predictable would have a high surprisal.
Next we examined how surprisal affected the ability of the neurons to accurately predict the correct semantic domains on a per-word level. To this end, we used SVC models similar to that described above, but now divided decoding performances between words that exhibited high versus low surprisal. Therefore, if the meaning representations of words were indeed modulated by sentence context, words that are more predictable on the basis of their preceding context should exhibit a higher decoding performance (that is, we should be able to predict their correct meaning more accurately from neuronal response).
To evaluate the organization of semantic representations within the neural population, we regressed the activity of each neuron onto the 300-dimensional embedded vectors. The normalized firing rate of each neuron was modelled as a linear combination of word embedding elements such that
in which \({F}_{i,w}\) is the firing rate of the i th neuron aligned to the onset of each word w , \({\theta }_{i}\) is a column vector of optimized linear regression coefficients, \({v}_{w}\) is the 300-dimensional word embedding row vector associated with word w , and \({\varepsilon }_{i}\) is the residual for the model. On a per-neuron basis, \({\theta }_{i}\) was estimated using regularized linear regression that was trained using least-squares error calculation with a ridge penalization parameter λ = 0.0001. The model values, \({\theta }_{i}\) , of each neuron (dimension = 1 × 300) were then concatenated (dimension = 133 × 300) to define a putative neuronal–semantic space θ . Together, these can therefore be interpreted as the contribution of a particular dimension in the embedding space to the activity of a given neuron, such that the resulting transformation matrix reflects the semantic space represented by the neuronal population.
Finally, a PC analysis was used to dimensionally reduce θ along the neuronal dimension. This resulted in an intermediately reduced space ( θ pca ) consisting of five PCs, each with dimension = 300, together accounting for approximately 46% of the explained variance (81% for the semantically selective neurons). As this procedure preserved the dimension with respect to the embedding length, the relative positions of words within this space could therefore be determined by projecting word embeddings along each of the PCs. Last, to quantify the degree to which the relation between word projections derived from this PC space (neuronal data) correlated with those derived from the word embedding space (English word corpus), we calculated their correlation across all word pairs. From a possible 258,121 word pairs (the availability of specific word pairs differed across participants), we compared the cosine distances between neuronal and word embedding projections.
As word projections in our PC space were vectoral representations, we could also calculate their hierarchical relations. Here we carried out an agglomerative single-linkage (that is, nearest neighbour) hierarchical clustering procedure to construct a dendrogram that represented the semantic relationships between all word projections in our PC space. We also investigated the correlation between the cophenetic distance in the word embedding space and difference in neuronal activity across all word pairs. The cophenetic distance between a word pair is a measure of inter-cluster dissimilarity and is defined as the distance between the largest two clusters that contain the two words individually when they are merged into a single cluster that contains both 49 , 50 , 51 . Intuitively, the cophenetic distance between a word pair reflects the height of the dendrogram where the two branches that include these two words merge into a single branch. Therefore, to further evaluate whether and to what degree neuronal activity reflected the hierarchical semantic relationship between words, as observed in English, we also examined the cophenetic distances in the 300-dimension word embedding space. For each word pair, we calculated the difference in neuronal activity (that is, the absolute difference between average normalized firing rates for these words across the population) and then assessed how these differences correlated with the cophenetic distances between words derived from the word embedding space. These analyses were performed on the population of semantically selective neurons ( n = 19). For further individual participant comparisons, the cophenetic distances were binned more finely and outliers were excluded to allow for comparison across participants.
To visualize the organization of word projections obtained from the PC analysis at the level of the population ( n = 133), we carried out a t- distributed stochastic neighbour embedding procedure that transformed each word projection into a new two-dimensional embedding space θ tsne (ref. 74 ). This transformation utilized cosine distances between word projections as derived from the neural data.
To further validate our results using a non-embedding approach, we used WordNet similarity metrics 75 . Unlike embedding approaches, which are based on the modelling of vast language corpora, WordNet is a database of semantic relationships whereby words are organized into ‘synsets’ on the basis of similarities in their meaning (for example, ‘canine’ is a hypernym of ‘dog’ but ‘dog’ is also a coordinate term of ‘wolf’ and so on). Therefore, although synsets do not provide vectoral representations that can be used to evaluate neuronal response to specific semantic domains, they do provide a quantifiable measure of word similarity 75 that can be regressed onto neuronal activity.
Finally, to ensure that our results were not driven by any particular participant(s), we carried out a leave-one-out cross-validation participant-dropping procedure. Here we repeated several of the analyses described above but now sequentially removed individual participants (that is, participants 1–10) across 1,000 iterations. Therefore, if any particular participant or group of participants disproportionally contributed to the results, their removal would significantly affect them (one-way analysis of variance, P < 0.05). A χ 2 test ( P < 0.05) was used to further evaluate for differences in the distribution of neurons across participants.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
All primary data supporting the findings of this study are available online at https://figshare.com/s/94962977e0cc8b405ef3 . Details of the participants’ demographics and task conditions are provided in Extended Data Table 1 .
All primary Python codes supporting the findings of this study are available online at https://figshare.com/s/94962977e0cc8b405ef3 . Software packages used in this study are listed in the Nature Portfolio Reporting Summary along with their versions.
Mesgarani, N., Cheung, C., Johnson, K. & Chang, E. F. Phonetic feature encoding in human superior temporal gyrus. Science 343 , 1006–1010 (2014).
Article CAS PubMed PubMed Central Google Scholar
Theunissen, F. E. & Elie, J. E. Neural processing of natural sounds. Nat. Rev. Neurosci. 15 , 355–366 (2014).
Article CAS PubMed Google Scholar
Baker, C. I. et al. Visual word processing and experiential origins of functional selectivity in human extrastriate cortex. Proc. Natl Acad. Sci. USA 104 , 9087–9092 (2007).
Fedorenko, E., Nieto-Castanon, A. & Kanwisher, N. Lexical and syntactic representations in the brain: an fMRI investigation with multi-voxel pattern analyses. Neuropsychologia 50 , 499–513 (2012).
Article PubMed Google Scholar
Humphries, C., Binder, J. R., Medler, D. A. & Liebenthal, E. Syntactic and semantic modulation of neural activity during auditory sentence comprehension. J. Cogn. Neurosci. 18 , 665–679 (2006).
Article PubMed PubMed Central Google Scholar
Kemmerer, D. L. Cognitive Neuroscience of Language (Psychology Press, 2014).
Binder, J. R., Desai, R. H., Graves, W. W. & Conant, L. L. Where is the semantic system? A critical review and meta-analysis of 120 functional neuroimaging studies. Cereb. Cortex 19 , 2767–2796 (2009).
Liuzzi, A. G., Aglinskas, A. & Fairhall, S. L. General and feature-based semantic representations in the semantic network. Sci. Rep. 10 , 8931 (2020).
Fedorenko, E., Behr, M. K. & Kanwisher, N. Functional specificity for high-level linguistic processing in the human brain. Proc. Natl Acad. Sci. USA 108 , 16428–16433 (2011).
Hagoort, P. The neurobiology of language beyond single-word processing. Science 366 , 55–58 (2019).
Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E. & Gallant, J. L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532 , 453–458 (2016).
Ralph, M. A., Jefferies, E., Patterson, K. & Rogers, T. T. The neural and computational bases of semantic cognition. Nat. Rev. Neurosci. 18 , 42–55 (2017).
Fedorenko, E., Blank, I. A., Siegelman, M. & Mineroff, Z. Lack of selectivity for syntax relative to word meanings throughout the language network. Cognition 203 , 104348 (2020).
Piantadosi, S. T., Tily, H. & Gibson, E. The communicative function of ambiguity in language. Cognition 122 , 280–291 (2012).
Tenenbaum, J. B., Kemp, C., Griffiths, T. L. & Goodman, N. D. How to grow a mind: statistics, structure, and abstraction. Science 331 , 1279–1285 (2011).
Article MathSciNet CAS PubMed Google Scholar
Kemp, C. & Tenenbaum, J. B. The discovery of structural form. Proc. Natl Acad. Sci. USA 105 , 10687–10692 (2008).
Grand, G., Blank, I. A., Pereira, F. & Fedorenko, E. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nat. Hum. Behav. 6 , 975–987 (2022).
Jamali, M. et al. Dorsolateral prefrontal neurons mediate subjective decisions and their variation in humans. Nat. Neurosci. 22 , 1010–1020 (2019).
Patel, S. R. et al. Studying task-related activity of individual neurons in the human brain. Nat. Protoc. 8 , 949–957 (2013).
Khanna, A. R. et al. Single-neuronal elements of speech production in humans. Nature 626 , 603–610 (2024).
Jamali, M. et al. Single-neuronal predictions of others’ beliefs in humans. Nature 591 , 610–614 (2021).
Braga, R. M., DiNicola, L. M., Becker, H. C. & Buckner, R. L. Situating the left-lateralized language network in the broader organization of multiple specialized large-scale distributed networks. J. Neurophysiol. 124 , 1415–1448 (2020).
DiNicola, L. M., Sun, W. & Buckner, R. L. Side-by-side regions in dorsolateral prefrontal cortex estimated within the individual respond differentially to domain-specific and domain-flexible processes. J. Neurophysiol. 130 , 1602–1615 (2023).
Blank, I. A. & Fedorenko, E. No evidence for differences among language regions in their temporal receptive windows. Neuroimage 219 , 116925 (2020).
Walenski, M., Europa, E., Caplan, D. & Thompson, C. K. Neural networks for sentence comprehension and production: an ALE-based meta-analysis of neuroimaging studies. Hum. Brain Mapp. 40 , 2275–2304 (2019).
Tang, J., LeBel, A., Jain, S. & Huth, A. G. Semantic reconstruction of continuous language from non-invasive brain recordings. Nat. Neurosci. 26 , 858–866 (2023).
Amirnovin, R., Williams, Z. M., Cosgrove, G. R. & Eskandar, E. N. Visually guided movements suppress subthalamic oscillations in Parkinson’s disease patients. J. Neurosci. 24 , 11302–11306 (2004).
Tankus, A. et al. Subthalamic neurons encode both single- and multi-limb movements in Parkinson’s disease patients. Sci. Rep. 7 , 42467 (2017).
Justin Rossi, P. et al. The human subthalamic nucleus and globus pallidus internus differentially encode reward during action control. Hum. Brain Mapp. 38 , 1952–1964 (2017).
Coughlin, B. et al. Modified Neuropixels probes for recording human neurophysiology in the operating room. Nat. Protoc. 18 , 2927–2953 (2023).
Paulk, A. C. et al. Large-scale neural recordings with single neuron resolution using Neuropixels probes in human cortex. Nat. Neurosci. 25 , 252–263 (2022).
Landauer, T. K. & Dumais, S. T. A solution to Plato’s problem: the latent semanctic analysis theory of the acquisition, induction, and representation of knowledge. Psychol. Rev. 104 , 211–140 (1997).
Article Google Scholar
Lenci, A. Distributional models of word meaning. Annu. Rev. Linguist . 4 , 151–171 (2018).
Dhillon, I. & Modha, D. S. Concept decompositions for large sparse text data using clustering. Mach. Learn. 42 , 143–175 (2001).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781 (2013).
Řehůřek, R. & Sojka, P. Software framework for topic modelling with large corpora. In Proc. LREC 2010 Workshop on New Challenges for NLP Frameworks 45–50 (2010); https://doi.org/10.13140/2.1.2393.1847 .
Pereira, F. et al. Toward a universal decoder of linguistic meaning from brain activation. Nat. Commun. 9 , 963 (2018).
Nishida, S. & Nishimoto, S. Decoding naturalistic experiences from human brain activity via distributed representations of words. Neuroimage 180 , 232–242 (2018).
Henry, S., Cuffy, C. & McInnes, B. T. Vector representations of multi-word terms for semantic relatedness. J. Biomed. Inform. 77 , 111–119 (2018).
Pereira, F., Gershman, S., Ritter, S. & Botvinick, M. A comparative evaluation of off-the-shelf distributed semantic representations for modelling behavioural data. Cogn. Neuropsychol. 33 , 175–190 (2016).
Wehbe, L. et al. Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses. PLoS ONE 9 , e112575 (2014).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26 , 3111–3119 (2013).
Rousseeuw, P. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math. 20 , 53–65 (1987).
Wasserman, L. All of Statistics: A Concise Course in Statistical Inference (Springer, 2005).
Pennington J., Socher, R. & Manning C. D. GloVe: global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (eds Moschitti, A. et al.) 1532–1543 (Association for Computational Linguistics, 2014).
Rodd, J. M. Settling into semantic space: an ambiguity-focused account of word-meaning access. Perspect. Psychol. Sci . https://doi.org/10.1177/1745691619885860 (2020).
Schvaneveldt, R. W. & Meyer, D. E. Lexical ambiguity, semantic context, and visual word recognition. J. Exp. Psychol. Hum. Percept. Perform. 2 , 243–256 (1976).
McAdams, C. J. & Maunsell, J. H. Effects of attention on orientation-tuning functions of single neurons in macaque cortical area V4. J. Neurosci. 19 , 431–441 (1999).
Sokal, R. R. & Rohlf, J. The comparison of dendrograms by objective methods. Taxon 11 , 33–40 (1962).
Saraçli, S., Doğan, N. &, Doğan, I. Comparison of hierarchical cluster analysis methods by cophenetic correlation. J. Inequalities Appl. 2013 , 203 (2013).
Hoxha, J., Jiang, G. & Weng, C. Automated learning of domain taxonomies from text using background knowledge. J. Biomed. Inform. 63 , 295–306 (2016).
Eddington, C. M. & Tokowicz, N. How meaning similarity influences ambiguous word processing: the current state of the literature. Psychon. Bull. Rev. 22 , 13–37 (2015).
Buchweitz, A., Mason, R. A., Tomitch, L. M. & Just, M. A. Brain activation for reading and listening comprehension: an fMRI study of modality effects and individual differences in language comprehension. Psychol. Neurosci. 2 , 111–123 (2009).
Jobard, G., Vigneau, M., Mazoyer, B. & Tzourio-Mazoyer, N. Impact of modality and linguistic complexity during reading and listening tasks. Neuroimage 34 , 784–800 (2007).
Williams, Z. M., Bush, G., Rauch, S. L., Cosgrove, G. R. & Eskandar, E. N. Human anterior cingulate neurons and the integration of monetary reward with motor responses. Nat. Neurosci. 7 , 1370–1375 (2004).
Sheth, S. A. et al. Human dorsal anterior cingulate cortex neurons mediate ongoing behavioural adaptation. Nature 488 , 218–221 (2012).
Amirnovin, R., Williams, Z. M., Cosgrove, G. R. & Eskandar, E. N. Experience with microelectrode guided subthalamic nucleus deep brain stimulation. Oper. Neurosurg. 58 , ONS-96–ONS-102 (2006).
Caro-Martin, C. R., Delgado-Garcia, J. M., Gruart, A. & Sanchez-Campusano, R. Spike sorting based on shape, phase, and distribution features, and K-TOPS clustering with validity and error indices. Sci. Rep. 8 , 17796 (2018).
Pedreira, C., Martinez, J., Ison, M. J. & Quian Quiroga, R. How many neurons can we see with current spike sorting algorithms? J. Neurosci. Methods 211 , 58–65 (2012).
Henze, D. A. et al. Intracellular features predicted by extracellular recordings in the hippocampus in vivo. J. Neurophysiol. 84 , 390–400 (2000).
Rey, H. G., Pedreira, C. & Quian Quiroga, R. Past, present and future of spike sorting techniques. Brain Res. Bull. 119 , 106–117 (2015).
Oliynyk, A., Bonifazzi, C., Montani, F. & Fadiga, L. Automatic online spike sorting with singular value decomposition and fuzzy C-mean clustering. BMC Neurosci. 13 , 96 (2012).
MacMillan, M. L., Dostrovsky, J. O., Lozano, A. M. & Hutchison, W. D. Involvement of human thalamic neurons in internally and externally generated movements. J. Neurophysiol. 91 , 1085–1090 (2004).
Sarma, S. V. et al. The effects of cues on neurons in the basal ganglia in Parkinson’s disease. Front. Integr. Neurosci. 6 , 40 (2012).
Windolf, C. et al. Robust online multiband drift estimation in electrophysiology data. Preprint at bioRxiv https://doi.org/10.1109/ICASSP49357.2023.10095487 (2022).
Schmitzer-Torbert, N., Jackson, J., Henze, D., Harris, K. & Redish, A. D. Quantitative measures of cluster quality for use in extracellular recordings. Neuroscience 131 , 1–11 (2005).
Neymotin, S. A., Lytton, W. W., Olypher, A. V. & Fenton, A. A. Measuring the quality of neuronal identification in ensemble recordings. J. Neurosci. 31 , 16398–16409 (2011).
Oby, E. R. et al. Extracellular voltage threshold settings can be tuned for optimal encoding of movement and stimulus parameters. J. Neural Eng. 13 , 036009 (2016).
Perel, S. et al. Single-unit activity, threshold crossings, and local field potentials in motor cortex differentially encode reach kinematics. J. Neurophysiol. 114 , 1500–1512 (2015).
Banerjee, A., Dhillon, I. S., Ghosh, J. & Sra, S. Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learn. Res. 6 , 1345–1382 (2005).
MathSciNet Google Scholar
Manning, C. D., Raghavan, P. & Schütze, H. Introduction to Information Retrieval (Cambridge Univ. Press, 2008).
Brennan, J. R., Dyer, C., Kuncoro, A. & Hale, J. T. Localizing syntactic predictions using recurrent neural network grammars. Neuropsychologia 146 , 107479 (2020).
Tenenbaum, J. B., de Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290 , 2319–2323 (2000).
Sigman, M. & Cecchi, G. A. Global organization of the Wordnet lexicon. Proc. Natl Acad. Sci. USA 99 , 1742–1747 (2002).
Fedorenko, E. et al. Neural correlate of the construction of sentence meaning. Proc. Natl Acad. Sci. USA 113 , E6256–E6262 (2016).
Willems, R. M., Frank, S. L., Nijhof, A. D., Hagoort, P. & van den Bosch, A. Prediction during natural language comprehension. Cereb. Cortex 26 , 2506–2516 (2016).
Download references
M.J. is supported by the Canadian Institutes of Health Research, a Brain & Behavior Research Foundation Young Investigator Grant and the Foundations of Human Behavior Initiative; B.G. is supported by the Neurosurgery Research & Education Foundation and a National Institutes of Health (NIH) National Research Service Award; A.R.K. and W.M. are supported by NIH R25NS065743; A.C.P. is supported by UG3NS123723, Tiny Blue Dot Foundation and P50MH119467; S.S.C. is supported by R44MH125700 and Tiny Blue Dot Foundation; E.F. is supported by U01NS121471 and R01 DC016950; and Z.M.W. is supported by NIH R01DC019653 and U01NS121616. We thank the participants; J. Schweitzer for assistance with the recordings; D. Lee, B. Atwater and Y. Kfir for data processing; and J. Tenenbaum, R. Futrell and Y. Cohen for their valuable feedback and suggestions.
These authors contributed equally: Mohsen Jamali, Benjamin Grannan
Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Mohsen Jamali, Benjamin Grannan, Jing Cai, Arjun R. Khanna, William Muñoz, Irene Caprara & Ziv M. Williams
Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Angelique C. Paulk & Sydney S. Cash
Center for Neurotechnology and Neurorecovery, Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
Department of Brain and Cognitive Sciences and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA
Evelina Fedorenko
Harvard-MIT Division of Health Sciences and Technology, Boston, MA, USA
Ziv M. Williams
Harvard Medical School, Program in Neuroscience, Boston, MA, USA
You can also search for this author in PubMed Google Scholar
M.J., B.G., A.R.K., W.M., I.C. and Z.M.W. carried out the experiments; M.J., B.G. and J.C. carried out neuronal analyses; A.C.P., S.S.C. and Z.M.W. developed the Neuropixels recording approach; M.J., W.M. and I.C. processed the Neuropixels data; E.F. provided linguistic materials and feedback; M.J., B.G., J.C., A.R.K., W.M., I.C., A.C.P., S.S.C. and E.F. edited the manuscript; and Z.M.W. conceived and designed the study, wrote the paper and supervised all aspects of the research.
Correspondence to Ziv M. Williams .
Competing interests.
The authors declare no competing interests.
Peer review information.
Nature thanks Peter Hagoort, Frederic Theunissen and Kareem Zaghloul for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data fig. 1 language-related activity, recording stability, waveform morphology and isolation quality across recording techniques..
a , Example of waveform morphologies displaying mean waveform ± 3 s.d and associated PC distributions used to isolate putative units from the tungsten microarray recordings. The horizontal bar indicates a 500 µs interval for scale. The gray areas in PC space represent noise. All single units recorded from the same electrode were required to display a high degree of separation in PC space. b , Isolation metrics of the single units obtained from the tungsten microarray recordings. c , Left , waveform morphologies observed across contacts in a Neuropixels array. Right , PC distributions used to isolate and cluster single units. d , Isolation distance and nearest neighbor noise overlap of the recorded units obtained from the Neuropixels arrays.
a , The d’ ( d -prime) indices measuring separability between the distribution of the vectoral cosine distances among all words within a cluster (purple) and those among all words across clusters (gray). The d’ indices were all above 2.5 reflecting strong separability. b , Selectivity index of neurons (mean with 95% CL, n = 19) when semantic domains were refined by moving or removing words whose meanings did not intuitively fit with their respective labels (Extended Data Table 2 ). c , There was no significant difference (χ 2 = 2.33, p = 0.31) in the proportions of neurons that displayed semantic selectivity based on the participants’ clinical conditions of essential tremor (ET), Parkinson’s disease (PD) or cervical dystonia (CD). d , Left , the proportional contribution per participant based on the total percentage of neurons contributed. Right , the proportional contribution of semantically selective cells per participant based on the fraction contributed. Participants without selective cells are not shown. e , A leave one out cross-validation participant-dropping procedure demonstrated that population results remained similar. Here, we sequentially removed individual participants (i.e., participants #1-10) and then repeated our selectivity analysis. Semantic selectivity across neurons was largely unaffected by removal of any of the participants (one-way ANOVA, F (9, 44) = 0.11, p = 0.99). Here, the mean selectivity indices (± s.e.m.) are separately presented after removing each participant. f , A cross-validation participant-dropping procedure was used to determine whether any of the participants disproportionately contributed to the population decoding. Average decoding results and comparison to the shuffled data are separately presented after removing each participant (permutation test, p < 0.01; #1-10).
a , Coincidence matrix illustrating the distribution of cells obtained from Neuropixels recordings that displayed selective responses to one or more semantic domains (two-tailed rank-sum test, p < 0.05, FDR adjusted). Inset , proportions of cells that displayed selective responses to one or more semantic domains. b , The distributions of SIs are shown separately for semantically-selective ( n = 29, orange) and non-selective ( n = 125, grey) cells. The mean SI of cells that did not display semantic selectivity ( n = 125) was 0.16 (one-sided rank-sum test, z -value = 7.2, p < 0.0001). Inset , selectivity index (SI) of each neuron ( n = 29) when compared across different semantic domains. c , The cumulative decoding performance (± s.d.) of all semantically selective neurons during sentences (blue) versus chance (orange). Inset , decoding performances (± s.d.) across two independent embedding models (Word2Vec and GloVe). d , Decoding accuracies for words that displayed high vs. low surprisal based on the preceding sentence contexts in which they were heard. Actual and chance decoding performances are shown in blue and orange, respectively (mean ± s.d., one-sided rank-sum test z -value = 25, p < 0.001). The inset shows a regression analysis on the relation between decoding performance and surprisal. e , Left , SI distributions for neurons during word list and sentence presentations together with the number of neurons that responded selectivity to one or more semantic domains ( Inset ). Right , the SI for neurons (mean with 95% CL, n = 21; excluding zero firing rate neurons) during word-list presentation. The SI dropped from 0.39 (CI = 0.33-0.45) during the sentences to 0.29 (CI = 0.19-0.39) during word list presentation (signed-rank test, z (41) = 168, p = 0.035). f , The selectivity index of neurons for which nonword lists presentation was performed ( n = 26 of 153 cells were selective) when comparing their activities during sentences vs. nonwords (mean SI = 0.34, CI = 0.28-0.40). Here, the selectivity of each neuron reflects the degree to which it differentiates any semantic (meaningful) compared to non-semantic (nonmeaningful) information. g , Contribution to the variance explained in PC space for word projections across participants using a participant-dropping procedure. h , Activities of neurons for word pairs based on their vectoral cosine distance within the 300-dimensional embedding space ( z -scored activity change over percentile cosine similarity; Pearson’s correlation r = 0.21, p < 0.001).
a , The distributions of SIs are shown separately for cells that displayed significance for semantic information ( n = 19, orange) and those that did not ( n = 114, grey). The mean SI of cells that did not display semantic selectivity ( n = 114) was 0.14 (one-sided rank-sum test, z -value = 5.8, p < 0.0001). b , Decoding performances (mean ± s.d.) for cells that were not significantly selective for any particular semantic domain but which had an SI greater than 0.2 ( n = 11) compared to that of shuffled data (21 ± 6%; permutation test, p = 0.046). c , The selectivity index of neurons for which nonword lists presentation was performed ( n = 27 of 48 cells for which this control was performed displayed a significant difference in activity using a two-sided t -test) when comparing their responses to nonwords (i.e., that carried no linguistic meaning) versus sentences (i.e., that carried linguistic meaning; mean SI = 0.43, CI = 0.35-0.51). The semantically selective cells ( n = 6, red) displayed a similar word vs. nonword SI when compared to the non-semantically selective cells ( n = 21, orange; two-sided t -test, df = 26, p = 1.0). d , Peristimulus histograms (mean ± s.e.m.) and rasters of representative neurons when the participants were given words heard within sentences (red) or sets of nonwords (gray). The horizontal green bars display the 400 ms window of analysis.
a , Average decoding performances (± s.d., purple, n = 1000 iterations) were found to be slightly lower for words heard early (first 4 words) vs. late (last 4 words) within their respective sentences (23 ± 7% vs. 29 ± 8% decoding performance, respectively; One-sided rank sum test, z -value = 17, p < 0.001) 76 , 77 . The orange bars represent control accuracy with shuffling neuronal activities. b , Cumulative mean decoding performance (±s.d., purple) for multi-units (MUs) compared with chance (orange). The mean decoding accuracy for all MUs was 23 ± 6% s.d. (one-sided permutation test, p = 0.02) and reflect the unsorted activities of units obtained through recordings ( Methods ). c , Relationship between the number of neurons considered, the number of word clusters modeled, and prediction accuracy. Here, a lower number of clusters leads to more words per grouping and therefore domains that are not as specific in meaning (e.g., “ sun ”, “ rain ”, “ clouds ”, and “ sky ”,) whereas a higher number of clusters means fewer words and therefore domains that are more specific in meaning (e.g., “ rain ” and “ clouds ”). d , The percent improvement in decoding accuracy (mean ± s.e.m) corresponds to decoding performance minus chance probability using 60% of randomly selected sentences for modeling and 40% for decoding ( n = 200 iterations). Inset , relation between log of odds probability (mean ± s.e.m) of predicting the correct semantic domains and number of clusters (i.e., not accounting for chance probability). e , The relation between the number of word clusters modeled and the percent improvement in decoding accuracy (mean ± s.e.m) when considering semantically selective (high SI) and non-selective (low SI) cells separately.
a , Comparison of average decoding performances (± s.d., blue, n = 200 iterations) for sentences and naturalistic story narratives, matched based on the number of neurons ( left : 2 neurons, right : 5 neurons). b , Comparison of average decoding performances (± s.d., blue, n = 200 iterations) for sentences, matched based on the number of single-units or multi-units ( left : 2 units, right : 5 units). Chance decoding performances are given in gray.
a , Contribution to percent variance explained in PC space for word projections across participants using a participant-dropping procedure (first 5-15 PCs; two-sided z -test; p > 0.7). b , Correlation between the vectoral cosine distances between PC-reduced word-projections derived from the neural data and PC-reduced vectors derived from the 300-dimensional word embedding space ( n = 258,121 possible word-pairs; note that not all pairs were used for all recordings per neuron since certain words were not heard by all participants). c , Difference in neuronal activities ( n = 19 neurons, p = 0.048, two-sided paired t -test, t (18) = 2.12) for word pairs whose vectoral cosine distances were far versus near in the word embedding space. d , Relation between neuronal activity and word meaning similarity using a non-embedding based ‘synset’ approach ( n = 100 bins, Pearson’s correlation r = −0.76, p = 0.001). Here, the degree of similarity ranges from 0 to 1.0, with a value of 1.0 indicating that the words are highly similar in meaning (e.g., “ canine ” and “ dog ”) and 0 indicating that their meanings are largely distinct.
Reporting summary, rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
Cite this article.
Jamali, M., Grannan, B., Cai, J. et al. Semantic encoding during language comprehension at single-cell resolution. Nature (2024). https://doi.org/10.1038/s41586-024-07643-2
Download citation
Received : 26 May 2020
Accepted : 31 May 2024
Published : 03 July 2024
DOI : https://doi.org/10.1038/s41586-024-07643-2
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
921 Accesses
Speech system ; Sound generation
Speech production is the process of uttering articulated sounds or words, i.e., how humans generate meaningful speech. It is a complex feedback process in which also hearing, perception, and information processing in the nervous system and the brain is involved.
Speaking is in essence the by-product of a necessary bodily process, the expulsion from the lungs of air charged with carbon dioxide after it has fulfilled its function in respiration. Most of the time one breathes out silently; but it is possible, by contracting and relaxing the vocal tract to change the characteristics of the air expelled from the lungs.
Speech is one of the most natural forms of communication for human beings. Researchers in speech technology are working on developing systems with the ability to understand speech and speak with a human being.
Human–computer interaction is a discipline concerned with the design, evaluation, and implementation of...
This is a preview of subscription content, log in via an institution to check access.
Subscribe and save.
Tax calculation will be finalised at checkout
Purchases are for personal use only
Institutional subscriptions
Hewett, T., Baecker, R., Card, S., Carey, T., Gasen, J., Mantei, M., Perlman, G., Strong, G., Verplank, W.: Chapter 2: Human–computer interaction. In: B. Hefley (ed.) ACM SIGCHI Curricula for Human–Computer Interaction. ACM, (2007)
Google Scholar
Fant, G.: Acoustic Theory of Speech Production, 1st edn. Mouton, The Hague (1960)
Fant, G.: Glottal flow: models and interaction. J. Phon. 14 , 393–399 (1986)
Kent, R.D., Adams, S.G., Turner, G.S.: Models of speech production. In: Principles of Experimental Phonetics, pp. 2–45. N.J. Lass, Mosby (1996)
Burrows, T.L.: Speech Processing with Linear and Neural Network Models (1996)
Deller, J.R., Proakis, j.G., Hansen, J.H.L.: Discrete-Time Processing of Speech Signals, 1st edn. Macmillan, New York (1993)
Download references
Authors and affiliations.
University of Vigo, Vigo, Spain
Laura Docio-Fernandez & Carmen Garcia-Mateo
You can also search for this author in PubMed Google Scholar
Editors and affiliations.
Center for Biometrics and Security Research, Chinese Academy of Sciences, Beijing, China
Stan Z. Li ( Professor ) ( Professor )
Departments of Computer Science & Engineering, Michigan State University, East Lansing, MI, USA
Anil Jain ( Professor ) ( Professor )
Reprints and permissions
© 2009 Springer Science+Business Media, LLC
Cite this entry.
Docio-Fernandez, L., Garcia-Mateo, C. (2009). Speech Production. In: Li, S.Z., Jain, A. (eds) Encyclopedia of Biometrics. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-73003-5_199
DOI : https://doi.org/10.1007/978-0-387-73003-5_199
Publisher Name : Springer, Boston, MA
Print ISBN : 978-0-387-73002-8
Online ISBN : 978-0-387-73003-5
eBook Packages : Computer Science Reference Module Computer Science and Engineering
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Policies and ethics
IMAGES
VIDEO
COMMENTS
Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vocal apparatus. Speech production can be spontaneous such as when a person creates the words of a ...
Speech production refers to the complex process of articulating sounds and words. It involves hearing, perception, and information processing by the brain and nervous system. Phonology and ...
Definition. Speech production is the process of uttering articulated sounds or words, i.e., how humans generate meaningful speech. It is a complex feedback process in which hearing, perception, and information processing in the nervous system and the brain are also involved. Speaking is in essence the by-product of a necessary bodily process ...
Speech production is a highly complex sensorimotor task involving tightly coordinated processing across large expanses of the cerebral cortex. Historically, the study of the neural underpinnings of speech suffered from the lack of an animal model. The development of non-invasive structural and functional neuroimaging techniques in the late 20 ...
2.1 How Humans Produce Speech Phonetics studies human speech. Speech is produced by bringing air from the lungs to the larynx (respiration), where the vocal folds may be held open to allow the air to pass through or may vibrate to make a sound (phonation).
Speech production is one of the most complex human activities. It involves coordinating numerous muscles and complex cognitive processes. The area of speech production is related to Articulatory Phonetics, Acoustic Phonetics and Speech Perception, which are all studying various elements of language and are part of a broader field of Linguistics.
Speech is the faculty of producing articulated sounds, which, when blended together, form language. Human speech is served by a bellows-like respiratory activator, which furnishes the driving energy in the form of an airstream; a phonating sound generator in the larynx (low in the throat) to transform the energy; a sound-molding resonator in ...
Speech production can be considered as a sensorimotor behavior that requires precise control and the dynamic interplay of several parallel processing levels. To produce an utterance, the respective information has to be selected, sequenced, and articulated in an adequate, highly time-sensitive manner.
Single-neuronal recordings have the potential to begin revealing some of the basic functional building blocks by which humans plan and produce words during speech and study these processes at ...
Producing speech takes three mechanisms. Respiration at the lungs. Phonation at the larynx. Articulation in the mouth. Let's take a closer look. Respiration (At the lungs): The first thing we need to produce sound is a source of energy. For human speech sounds, the air flowing from our lungs provides energy. Phonation (At the larynx ...
Figure 9.2 The Standard Model of Speech Production The Standard Model of Word-form Encoding as described by Meyer (2000), illustrating five level of summation of conceptualization, lemma, morphemes, phonemes, and phonetic levels, using the example word "tiger".
Speech production and acoustic properties #. 2.2.1. Physiological speech production #. 2.2.1.1. Overview #. When a person has the urge or intention to speak, her or his brain forms a sentence with the intended meaning and maps the sequence of words into physiological movements required to produce the corresponding sequence of speech sounds.
Speech production is a remarkable process that involves multiple intricate levels. From the initial conceptualization of ideas to their formulation into linguistic forms and the precise articulation of sounds, each stage plays a vital role in effective communication. Understanding these levels helps us appreciate the complexity of human speech and the incredible coordination between the brain ...
The evidence used by psycholinguistics in understanding speech production can be varied and interesting. These include speech errors, reaction time experiments, neuroimaging, computational modelling, and analysis of patients with language disorders. Until recently, the most prominent set of evidence for understanding how we speak came from ...
Uncover the science behind Speech Production with our in-depth exploration. Learn what speech is, how it works, and its components.
A theory of speech production provides an account of the means by which a planned sequence of language forms is implemented as vocal tract activity that gives rise to an audible, intelligible acoustic speech signal. Such an account must address several issues.
There are 4 stages of speech production that you probably have never thought about. Discover what goes into the words you speak by reading this article!
Models composed of two cavities with a connecting constriction can approximate the formants associated with several consonant sounds. Prosodic Features of Speech Prosodic features are characteristics of speech that convey meaning, emphasis, and emotion without actually changing the phonemes. Pitch, rhythm, and accent.
speech production: 1 n the utterance of intelligible speech Synonyms: speaking Types: speech the exchange of spoken words susurration , voicelessness , whisper , whispering speaking softly without vibration of the vocal cords stage whisper a loud whisper that can be overheard; on the stage it is heard by the audience but it supposed to be ...
Speech production is an important part of the way we communicate. We indicate intonation through stress and pitch while communicating our thoughts, ideas, requests or demands, and while maintaining grammatically correct sentences. However, we rarely consider how this ability develops.
Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words, and uses those words in their semantic character as words in the lexicon of a language according to the syntactic constraints that govern lexical words' function in a sentence.
The speech sounds are therefore considered as the response of the vocal-tract filter, into which a sound source is fed. To model such source-filter systems for speech production, the sound source, or excitation signal x t, is often implemented as a periodic impulse train for voiced speech, while white noise is used as a source for unvoiced speech.
Single-neuronal elements of speech production in humans ... single neurons that encode word meanings during comprehension and a process that could support our ability to derive meaning from speech ...
Definition. Speech production is the process of uttering articulated sounds or words, i.e., how humans generate meaningful speech. It is a complex feedback process in which also hearing, perception, and information processing in the nervous system and the brain is involved. Speaking is in essence the by-product of a necessary bodily process ...