Hostname: page-component-cc8bf7c57-77pjf Total loading time: 0 Render date: 2024-12-12T01:30:47.880Z Has data issue: false hasContentIssue false

Neurocomputational modeling of speech motor development

Published online by Cambridge University Press:  20 June 2023

Andrew M. MEIER*
Affiliation:
Department of Speech, Language and Hearing Sciences, Boston University, Boston, MA 02215
Frank H. GUENTHER
Affiliation:
Department of Speech, Language and Hearing Sciences, Boston University, Boston, MA 02215 Department of Biomedical Engineering, Boston University, Boston, MA 02215
*
Corresponding author: Andrew Meier; Email: amsmeier@bu.edu
Rights & Permissions [Opens in a new window]

Abstract

This review describes a computational approach for modeling the development of speech motor control in infants. We address the development of two levels of control: articulation of individual speech sounds (defined here as phonemes, syllables, or words for which there is an optimized motor program) and production of sound sequences such as phrases or sentences. We describe the DIVA model of speech motor control and its application to the problem of learning individual sounds in the infant’s native language. Then we describe the GODIVA model, an extension of DIVA, and how chunking of frequently produced phoneme sequences is implemented within it.

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press

The DIVA model of speech motor control

The Directions Into Velocities of Articulators (DIVA) model is an artificial neural network that provides a quantitative account of the computations underlying speech motor control (Guenther, Reference Guenther1995; Tourville & Guenther, Reference Tourville and Guenther2011; E. Golfinopoulos, Tourville, & Guenther, Reference Golfinopoulos, Tourville and Guenther2010; see Guenther, Reference Guenther2016 for a detailed treatment). It contains a network of simulated components which represent brain structures responsible for producing speech. The model includes an articulatory synthesizer that mimics the behavior of the vocal tract, and the neural network learns to control movements of the synthesizer’s articulators in order to produce intelligible speech. We focus herein on a higher-level treatment of the model’s neural computations and developmental processes, avoiding mathematical equations and computer implementation details for tractability.

To understand the model, we will start by defining a speech sound to be a “chunk” of speech that has its own optimized motor program in the brain. These chunks could be phonemes, syllables, and/or words, depending on the age and linguistic experience being considered. In keeping with a number of prior proposals (e.g., Kozhevnikov & Chistovich, Reference Kozhevnikov and Chistovich1965; Levelt, Reference Levelt1993; MacNeilage & Davis, Reference MacNeilage and Davis1990) and supported by distributional analyses of phoneme combinations (Sun & Poeppel, Reference Sun and Poeppel2022; Kessler & Treiman, Reference Kessler and Treiman1997), we suggest that the syllable is the most typical sound chunk with an optimized motor program. However, motor programs likely also exist for individual phonemes as well as frequently produced multisyllabic utterances, such as common words or names of familiar people and locations. Note that the motor programs can be hierarchical; for example, a syllabic motor program will consist of individual phoneme motor programs along with optimized transitions between these phoneme motor programs.

The model assumes that, in the mature speaker, speech production begins with an intended linguistic message being translated by higher-level brain regions into a sequence of speech sounds. Motor sequencing circuits then activate the appropriate nodes of a speech sound map in ventral premotor cortex (vPMC), which is the highest processing level represented in DIVA. While this model focuses on segmental control - production of phonemes, syllables, and words - it should be noted that prosodic control is also an essential goal of speech motor development (Mattys, Jusczyk, Luce, & Morgan, Reference Mattys, Jusczyk, Luce and Morgan1999; Kehoe & Stoel-Gammon, Reference Kehoe and Stoel-Gammon1997).

Neural components of the DIVA model

The brain structures whose functions are simulated by the DIVA model are illustrated in Figure 1. Each box corresponds to a set of modeled neurons, or nodes, that together form a neural map of some type of speech-relevant information. Larger boxes indicate cortical regions and smaller boxes indicate subcortical nuclei. Arrows represent excitatory projections while circles represent inhibitory projections, with the projection target being the area touching the arrowhead or circle. Production of a speech sound starts with activation of a node representing that particular sound in a speech sound map in the left ventral premotor cortex. Activation of this node leads to motor commands that arrive in motor cortex via two control systems: a feedforward control system and a feedback control system.

Figure 1. Neural correlates of the DIVA model. The main neural output of the model is provided by the vMC Articulator Map, which integrates feedforward commands from VL and the Speech Sound Map with feedback commands from VL and the Feedback Control Map. [Abbreviations: Cb=cerebellum (specific lobule unknown); Cb-VI=cerebellum lobule VI; GP=globus pallidus; MG=medial geniculate nucleus of the thalamus; pAC=posterior auditory cortex; SMA=supplementary motor area; SNr=substantia nigra pars reticula; VA=ventral anterior nucleus of the thalamus; VL=ventral lateral nucleus of the thalamus; vMC=ventral motor cortex; VPM=ventral posterior medial nucleus of the thalamus; vPMC=ventral premotor cortex; vSC=ventral somatosensory cortex.].

The feedforward control system generates previously learned motor programs for speech sounds. This process involves two components. The first component of feedforward control ensures that the motor program is initiated at the appropriate time. Timing control is carried out by a cortico-basal ganglia loop that includes an initiation map in the supplementary motor area (SMA). This loop identifies the appropriate sensory, motor, and cognitive context for producing the speech sound. We suggest that the input structures of the basal ganglia monitor these contextual cues, with the caudate monitoring cognitive context and the putamen monitoring sensory and motor contexts. When the appropriate context for producing a speech chunk is identified, a corresponding node is activated in the initiation map via the globus pallidus (GP), substantia nigra pars reticula (SNr), and the ventral anterior (VA) thalamic nucleus. This initiation map node activation triggers the readout (execution) of the learned motor program for the current speech sound.

The second component of the feedforward control system comprises the motor programs themselves, which generate feedforward commands for producing learned speech sounds. These commands are encoded by synaptic projections from the speech sound map to an articulator map in the right and left ventral primary motor cortex (vMC). The cortico-cortical projections from left vPMC to vMC are supplemented by a cerebellar loop passing through the pons, cerebellar cortex lobule VI (Cb-VI), and the ventral lateral (VL) nucleus of thalamus. This division of motor execution between cerebellar and basal ganglia loops was originally proposed in a theory founded on nonhuman primate neurophysiology (Hikosaka, Nakamura, Sakai, & Nakahara, Reference Hikosaka, Nakamura, Sakai and Nakahara2002), with later support being provided by human neuroimaging (Doyon et al., Reference Doyon, Bellec, Amsel, Penhune, Monchi, Carrier and Benali2009). Note that multiple instances of a structure in Figure 1, such as the Cb, are implemented as separate non-overlapping neural populations within that structure. For example, separate Cb networks process feedforward commands, auditory targets, and somatosensory targets.

The auditory feedback control subsystem detects and corrects for mismatches between the auditory target and the current auditory feedback. Axonal projections from speech sound map nodes in vPMC - both directly and via a cortico-cerebellar loop involving the pons, cerebellum (Cb), and medial geniculate (MG) nucleus of the thalamus – arrive at an auditory target map in the higher-order auditory cortical areas in posterior auditory cortex (pAC), including the posterior superior temporal gyrus and sulcus and the planum temporale. These projections signal the expected auditory percept generated by the sound currently being produced.

The auditory target for the current sound is compared to incoming reafferent auditory signals. This information is transmitted to cortical areas via MG and is represented in the model’s auditory state map. If the current auditory state does not match the target, auditory error nodes in the higher-order auditory cortical areas become active. These types of predictive and error-related responses have been localized to auditory cortex by neural recordings in humans (Hashimoto & Sakai, Reference Hashimoto and Sakai2003; Okada, Matchin, & Hickock, Reference Okada, Matchin and Hickok2018 ; Ozker, Doyle, Devinsky, & Flinker, Reference Ozker, Doyle, Devinsky and Flinker2022). Auditory error node activities are then transformed into corrective motor commands through projections from the auditory error nodes to the feedback control map in right vPMC, which in turn projects to the articulator map in vMC both directly and via a loop through the pons, Cb, and VL. Auditory error is computed as a simple subtraction of the target from the state. This subtraction is enabled by making the Auditory State, Target, and Error Maps contain identical representations of speech sounds and equalizing the strength of inputs from the Target and State Maps to the Error Map.

The DIVA model also contains a somatosensory feedback control subsystem, the main components of which are hypothesized to reside in ventral somatosensory cortex (vSC). Projections from the speech sound map to the somatosensory target map encode the expected somatosensory feedback during sound production. These projections include cortico-cortical as well as cortico-cerebellar loop projections via the ventral posterior medial (VPM) thalamic nucleus. The model’s somatosensory state map represents proprioceptive and tactile information from the speech articulators. If the somatosensory state does not match the current target, the somatosensory error map sends a corrective command via the feedback control map to correct subsequent motor commands. Studies in which articulator sensory feedback is perturbed during speaking suggest that the somatosensory error map resides primarily in ventral somatosensory cortex (Golfinopoulos, Tourville, Bohland, Ghosh, Nieto-Castanon, & Guenther, Reference Golfinopoulos, Tourville, Bohland, Ghosh, Nieto-Castanon and Guenther2011).

The components of the DIVA model are a set of heterogeneous, biophysically realistic neural networks. Different neural network structures were chosen for each component based on the distinct function they serve. For example, different architectures were required for the error maps, which compute differences between two input signals, and the Initiation Map, which controls the timing of activation in a downstream structure. Some components in Figure 1 were not instantiated as full neural networks, such as VA and VL, which serve as simple relays from the basal ganglia to the cortex.

Unlike other models of speech motor control (e.g., Hickok, Reference Hickok2014), feedforward commands in DIVA proceed directly to primary motor cortex, without comparison to an internal model of sensory consequences. The lack of sensorimotor knowledge present at this processing stage is not problematic in the scenarios addressed by the model, in which auditory targets have already been well learned. However, this simplification does reduce the application of DIVA in particular speech phenomena, such as internal error correction (Nozari, Dell, & Schwartz, Reference Nozari, Dell and Schwartz2011) and attempting to imitate unfamiliar sounds (e.g., Hao & Jong, Reference Hao and Jong2016).

Because most projections in the model are long-range and originate in the cerebral cortex, they are modeled as excitatory, to match known neuroanatomy (DeFelipe & Fariñas, Reference DeFelipe and Fariñas1992; see Urrutia-Piñones, Morales-Moraga, Sanguinetti-González, Escobar, & Chiu, Reference Urrutia-Piñones, Morales-Moraga, Sanguinetti-González, Escobar and Chiu2022 regarding exceptions to this pattern). In the case of error maps, inputs are modeled as inhibitory, which is necessary for detecting differences between sensory states and sensory targets. Correlates in the brain of these projections to error maps likely use feedforward inhibition, in which a source area provides long-range excitatory projections to inhibitory neurons in a target area, effectively inhibiting certain excitatory neurons in that target area (Li, Ji, Liang, Li, Xiao, Tao, & Zhang, Reference Li, Ji, Liang, Li, Xiao, Tao and Zhang2014; Naskar, Qi, Pereira, Gerfen, & Lee, Reference Naskar, Qi, Pereira, Gerfen and Lee2021). All pathways in Figure 1 are assumed to have been established by birth, though the micro-scale patterns and weights of connections maintain plasticity, allowing for further postnatal development (Kostović & Jovanov-Milošević, Reference Kostović and Jovanov-Milošević2006; Dubois et al., Reference Dubois, Dehaene-Lambertz, Kulikova, Poupon, Hüppi and Hertz-Pannier2014).

Implementation of speech motor learning in DIVA

In order for the DIVA model to produce speech, it must undergo a learning process analogous to what occurs in the developing infant brain. The stages of this process are simplified for the purposes of implementation into a babbling phase and an imitation phase.

The babbling phase involves the generation of semi-random articulator movements through activation of nodes in the model’s articulation map (corresponding to vMC), which drives movements of the speech articulators and the generation of auditory and somatosensory feedback signals. The resulting combination of auditory, somatosensory, and articulatory representations is used to tune inverse models that map somatosensory and auditory errors into corrective motor commands via the feedback control map in Figure 1. The learning in this stage is not phoneme- or syllable-specific; the learned sensory-motor transformations are applicable to all speech sounds that will be learned later.

During the imitation phase, the model is presented with sample speech sounds to learn, similar to an infant being exposed to the sounds of their native language. These sounds take the form of time-varying acoustic signals corresponding to phonemes, syllables, or words. Based on these samples, the model first learns an auditory target for each sound. Learning of a sound’s auditory target involves activation of a speech sound map node that will later represent the sound for production. This occurs via a speech recognition system when the model “hears” the soundFootnote 1, which corresponds to a child hearing a new speech sound directed at him/her them by a parent, for example. This in turn leads to adjusting synaptic weights in the projections from that speech sound map node to the auditory cortex to encode the sound’s auditory target.

After an auditory target for a sound has been learned, the model can attempt to produce the sound. The appropriate nodes in the initiation map and speech sound map must first be activated. At first, the model will not have a tuned motor program for producing the sound in a feedforward manner, nor will it have a somatosensory target. Thus, the system will depend primarily on auditory feedback for guidance. On each production attempt, the motor target will be updated to incorporate the commands generated by the auditory feedback control subsystem on that attempt. These commands are generated by first determining the auditory error (i.e., the distance and direction in auditory space between the target and what was produced) in the Auditory Error Map. The auditory error is then sent to the Feedback Control Map, where it is transformed into articulator movements that will reverse the auditory error. This corrective signal is then sent to the Articulator Map, where it adjusts the velocities of articulator movements. Subsequent attempts will then have a more accurate feedforward command to guide production.

Over time, the feedforward commands will become sufficient by themselves for reliably producing the sound. That is, the motor program will have become accurate enough that it generates very few auditory errors, obviating the need for auditory feedback control in most instances. At this point the model can fluently produce the speech sound. As the speech articulators grow, the auditory feedback control subsystem continually corrects for changes in the biomechanics of the vocal tract. These corrective commands are subsumed into the motor program, thus allowing it to stay tuned despite significant changes to the shapes and sizes of the articulators over the course of life.

As the model repeatedly produces a sound, it also learns a somatosensory target region for that sound, analogous to the auditory target region. The somatosensory target represents the expected proprioceptive and tactile sensations elicited when producing the sound. This target is different from the auditory target in that it cannot be learned from other speakers, as essential information about tactile patterns, tongue shape, etc. are not available to a listener. The somatosensory target must instead be learned through self-monitoring of one’s own correct productions, a process that occurs at a later stage than the learning of auditory targets.

The simulation study of Callan, Kent, Guenther, and Vorperian (Reference Callan, Kent, Guenther and Vorperian2000) provides an example of how the DIVA model has been used to investigate speech motor development. This study involved computer simulations of the process of learning and correctly producing English vowels during developmental growth of the vocal tract. The model was grounded in empirical data by including the sizes and shapes of infant vocal tracts measured with magnetic resonance imaging. Vowel formants were successfully produced along a developmental timeline that matched those observed in real developing infants, showing the feasibility of the model. The simulation provided additional insight into speech development by showing how infants could make use of motor equivalence to produce a sound, even under the constraints of changing articulator shapes and sizes.

Development of speech motor programs

The motor learning process implemented in computer simulations of the DIVA model as described in the previous section is a highly simplified approximation of speech development in children. In the current section, we provide a more detailed account of the stages of speech development in infants and children with reference to components of the DIVA model.

Overview of infant babbling

The first two months of infancy are characterized by a phonation stage (see Oller, Reference Oller, Yeni-komshian, Kavanagh and Ferguson1980, and Stark, Reference Stark1980, for reviews of infant babbling), during which speech-like vocalizations are only rarely exhibited. The few speech-like sounds that can be observed consist largely of phonation with the mouth closed or nearly closed. The next developmental phase, occurring from 2 to 3 months of age, is known as the “goostage and is characterized by the production of crude syllable-like sequences composed mostly of velar consonant-like elements in combination with vowel-like elements. By 4 to 6 months old, most infants enter the expansion stage, characterized by the production of several new sound types, including labiolingual and bilabial trills, growls, and squeals. The expansion stage may also contain some of marginal babbling, consisting of vocal tract closures in combination with better-formed vowel-like utterances. Seven months of age sees most infants entering the canonical or reduplicated babbling stage, in which syllables with adult-like timing characteristics emerge. During this stage, many utterances consist of reduplicated syllables such as “bababa”. The nonreduplicated babbling stage follows at around 10 months old; it is characterized by the use of different consonants and vowels within the same babbling sequence (e.g., “dadabi”). It has been suggested (MacNeilage & Davis, Reference MacNeilage and Davis1990) that during the nonreduplicated babbling stage infants begin learning how to produce the phonemes of their native language.

An important feature of this developmental sequence is that many non-speech vocalizations and articulator movements occur well before the onset of frequent speech sounds. It is this observation that motivates the two learning stages of the DIVA model. In the first stage, sensory-motor relationships between the motor, somatosensory, and auditory systems are learned. In a sense, this stage consists of learning about the biophysics of the vocal tract; that is, the infant learns the sensory consequences of various oromotor actions. In the second stage, individual speech sounds from the native language are learned. While these stages are typically carried out sequentially in model simulations for convenience, the real speech motor learning process is not so discrete (e.g., de Boysson-Bardies, Sagart, & Durand, Reference Boysson-Bardies, Sagart and Durand1984; Boysson-Bardies, Hallé, Sagart, & Durand, Reference Boysson-Bardies, Hallé, Sagart and Durand1989; Mitchell & Kent, Reference Mitchell and Kent1990) and involves processes not addressed in computer simulations of DIVA. Table 1 provides an overview of these processes, which are detailed in the following paragraphs.

Table 1. Time-courses for development of the major capacities of the speech motor system. The estimated amount of learning occurring in a neural system within a given time window is indicated as being Low, Medium, or High. [Abbreviations: Aud.=auditory; Som.=somatosensory.]

Development of auditory and somatosensory maps

The ability to produce the speech sounds of a language depends heavily on the ability to perceive these sounds. Auditory representations of speech signals (corresponding to the DIVA auditory state and auditory error maps) show signs of language specificity in infants as young as 6 months of age (e.g., Kuhl, Williams, Lacerda, Stevens, & Lindblom, Reference Kuhl, Williams, Lacerda, Stevens and Lindblom1992). This likely reflects modifications in auditory cortical neuronal responses to optimally capture the auditory signatures of the native language. This developmental process likely does not require knowledge of the phonological units that make up the language, as it occurs at a very early stage of development (see row 1 of Table 1). The shaping of auditory representations can instead be driven by the statistical nature of the acoustic signals experienced by the infant (e.g., Guenther & Gjaja, Reference Guenther and Gjaja1996; Guenther, Husain, Cohen, & Shinn-Cunningham, Reference Guenther, Husain, Cohen and Shinn-Cunningham1999).

The somatosensory representations of the speech network, corresponding to the somatosensory state map in Figure 1, must also undergo development. Unlike auditory signals for speech, the somatosensory patterns associated with the sounds of a language cannot be learned by listening to native speakers. Thus, development of the somatosensory maps for speech likely lags behind development of auditory maps during the very early stages of infancy, at a time when articulations are limited. Once the infant starts producing more speech-like articulatory movements in the expansion, canonical babbling, and nonreduplicated babbling stages, their somatosensory maps likely become increasingly sensitive to the somatosensory patterns proceeding from these movements (row 2 of Table 1).

Development of sensory-motor transformations

The first movements of speech-related body parts begin almost immediately after birth, when an infant uses their vocal folds and respiratory system to cry and their lips, jaw, and tongue to feed. These movements generate somatosensory feedback and often auditory feedback as well, providing opportunities for the infant’s brain to learn about sensory consequences of oromotor actions. Our motor systems have the ability to anticipate sensory consequences of movements commanded by motor cortical activity. Tuning of these sensory-motor predictions, often referred to as forward models, likely begins with early non-speech actions, then accelerates as the infant creates more and more speech-like utterances as they move through the goo, expansion, canonical, and nonreduplicated babbling stages (rows 3, 4, and 5 in Table 1).

The articulatory movements which occur during infant babbling can also be used to tune transformations in the reverse direction, that is, sensory-to-motor transformations, or inverse models. These transformations consist of learned mappings between auditory and somatosensory representations of ongoing vocalizations and articulator movements that produce them. Prior to the development of auditory and somatosensory targets for speech sounds, nodes in the auditory and somatosensory error maps are not yet signaling “errors” per se; these nodes instead represent changes (velocities) in the auditory and somatosensory state that occur due to ongoing movements of speech articulators. This combination of motor activations and resulting sensory velocities enable the tuning of auditory-motor and somato-motor transformations well before an infant develops awareness of phonological units such as phonemes and words.

Later, as auditory and somatosensory targets are learned, the nodes in the auditory and somatosensory error maps stop reflecting ongoing changes in the sensory state and begin to reflect desired sensory changes (i.e., sensory errors, which can be thought of as desired sensory velocities for reaching the target). This development, which can be inferred to have occurred when infants begin to produce language-specific speech sounds, is reflected in the DIVA model by the transition from the babbling phase to the imitation phase, though the model does not simulate specific mechanisms for the cause of this transition. Some continued tuning of sensory-motor transformations likely continues into adulthood; evidence for such plasticity is provided by adaptation to somatosensory feedback perturbations (e.g., Houde & Jordan, Reference Houde and Jordan1998; Golfinopoulos et al., Reference Golfinopoulos, Tourville, Bohland, Ghosh, Nieto-Castanon and Guenther2011; Lametti, Nasir, & Ostry, Reference Lametti, Nasir and Ostry2012).

Speech recognition and phonological target acquisition

The learning processes described thus far do not require any knowledge of the distinct phonemes, syllables, or words of a language. Instead, they tune transformations between the largely continuous motor, somatosensory, and auditory spaces without regard for the discrete phonological units that make up a language. These transformations form the essential elements of the feedback control system schematized in Figure 1.

The ultimate goal of the speech motor system is, however, to produce these discrete speech sounds of the native language. Before a child can learn to articulate these sounds, it is required that they learn how to parse continuous auditory signals into discrete phonological categories such as words, syllables, and phonemes. This learning process corresponds to tuning of the speech recognition system and speech sound map in Figure 1. These learning processes (row 6 in Table 1) fall under the domain of speech perception and are not currently implemented in computer simulations of the DIVA model. Instead, speech sounds are presented to the model for learning; these sounds take the form of time-varying auditory signals (in particular, formant frequencies). Note that conscious awareness of phonemes is not a prerequisite for learning to produce phoneme strings; indeed, infants and children successfully learn words like “cat” and “hat” that differ only by a single phoneme despite not yet being consciously aware of phoneme units.

Development of sensory targets and feedforward control

As infants acquire auditory targets corresponding to phonemes and syllables, their brains store information about the sensory signals making up these objectives of speech motor output (row 7 in Table 1). The infant will then try to replicate these auditory targets. Projections to the auditory target map from the speech sound map encode these time-varying auditory targets for sounds represented in the speech sound map, so that these targets can be activated later during production of the corresponding sounds.

Infants have been reported to imitate caregivers’ vocalizations as early as 2 months old (Kuhl & Meltzoff, Reference Kuhl and Meltzoff1996; Kokkinaki & Kugiumutzakis, Reference Kokkinaki and Kugiumutzakis2000; Gratier & Devouche, Reference Gratier and Devouche2011), while other accounts argue that this capacity emerges closer to 1 year of age (Jones, Reference Jones2009). These initial utterances enable the infant to learn feedforward commands for producing these sounds on their own (row 8 in Table 1). Within the DIVA model, these feedforward commands are stored in synaptic projections from the speech sound map to the primary motor cortical areas, both directly and via a cortico-cerebellar loop.

Finally, after an infant can successfully produce speech sounds, the infant’s brain develops a somatosensory target map containing representations of the somatic sensations created by accurately producing the sound (row 9 in Table 1). These targets are used by the somatosensory feedback control system to rapidly detect and correct production errors in ongoing utterances.

Computational modeling of developmental speech disorders

In addition to modeling normal development of speech production, variations of DIVA have also been used to simulate possible mechanisms of childhood disorders that affect speech production. Max, Guenther, Gracco, Ghosh, and Wallace (Reference Max, Guenther, Gracco, Ghosh and Wallace2004) used mechanisms from DIVA to propose an account of developmental stuttering caused by dysfunctional use of auditory feedback. Subsequent simulation studies implemented this hypothesis (Civier, Tasko, & Guenther, Reference Civier, Tasko and Guenther2010), as well as alternative possible causes of the disorder (Civier, Bullock, Max, & Guenther, Reference Civier, Bullock, Max and Guenther2013). The neural etiology of childhood apraxia of speech has been addressed by DIVA modeling, in a study that simulated the disorder as resulting from impaired feedforward signaling (Terband, Maassen, Guenther, & Brumberg, Reference Terband, Maassen, Guenther and Brumberg2009; Miller & Guenther, Reference Miller and Guenther2021). A recent application of the model used it to explore motor and auditory processing in children with autism spectrum disorder (Chenausky, Brignell, Morgan, Norton, Tager-Flusberg, Schlaug, & Guenther, Reference Chenausky, Brignell, Morgan, Norton, Tager-Flusberg, Schlaug and Guenther2021). A promising future direction for similar investigations may be the use of LaDIVA, a modification of the model which incorporates detailed laryngeal physiology, for understanding voice disorders such as pediatric dysphonia (Weerathunge, Alzamendi, Cler, Guenther, Stepp, & Zañartu, Reference Weerathunge, Alzamendi, Cler, Guenther, Stepp and Zañartu2022).

Sequencing of speech motor programs

The previous sections discussed how the DIVA model simulates production of single speech motor programs and how these programs are learned and refined. Here we describe an extension to the DIVA model called the Gradient Order DIVA (GODIVA) model (Bohland, Bullock, & Guenther, Reference Bohland, Bullock and Guenther2010) that describes the neural processes underlying the buffering and sequential production of longer utterances consisting of multiple speech sounds, such as phrases or sentences. In infancy, the capacity for rudimentary speech sound sequencing begins to manifest during nonreduplicated babbling (Levitt & Utman, Reference Levitt and Utman1992; Nathani, Ertmer, & Stark, Reference Nathani, Ertmer and Stark2006). GODIVA provides a description for developmental processes underlying the learning of these abilities. Before exploring these mechanisms, we give an overview of the components of the model.

Neural components of the GODIVA model

Figure 2 illustrates a simplified schematic of the GODIVA model. The model consists of two basal ganglia-thalamo-cortical loops (shaded regions in the figure): a motor loop (whose components are shared with the DIVA model) responsible for initiating and terminating speech motor programs, and a planning loop that forms a phonological working memory that buffers upcoming speech sounds. The planning loop involves the posterior inferior frontal sulcus (pIFS) in lateral prefrontal cortex and the presupplementary motor area (preSMA) in the medial premotor cortex working in concert with the basal ganglia via projections to the head of the caudate nucleus, whereas the motor loop involves vPMC and SMA working in concert with the basal ganglia via projections to the putamen.

Figure 2. Simplified schematic of the GODIVA network model for speech sequence production. [Abbreviations: GP, globus pallidus; pIFS, posterior inferior frontal sulcus; preSMA, presupplementary motor area; SMA, supplementary motor area; VA, ventral anterior thalamic nucleus; VL, ventral lateral thalamic nucleus; vPMC, ventral premotor cortex].

The model’s cortical components can also be divided into medial and lateral cortical regions (indicated by dashed boxes in Figure 2), which represent distinct aspects of the speech utterance. One set of structures, the left lateral cortical areas pIFS and vPMC, contains representations of the speech sequence’s phonological content (hypothesized to reside in left pIFS) and corresponding motor programs (hypothesized to reside in left vPMC). A second set, the medial premotor areas preSMA and SMA, are responsible for the metrical structure of the phonological sequence. Specifically, preSMA is hypothesized to contain a representation of syllabic frame structure and metrical patterning for an upcoming utterance, whereas SMA contains an initiation map (as in DIVA) that is responsible for turning on and turning off individual speech motor programs at particular instants in time. The planning loop regions preSMA and pIFS in GODIVA both use a gradient order working memory representation in which nodes representing actions to be produced sooner have higher activation levels than those to be produced later; such a representation has been proposed in prior computational models of working memory and sequencing (e.g., Lashley, Reference Lashley and Jeffress1951; Grossberg, Reference Grossberg1978; Houghton, Reference Houghton, Dale, Mellish and Zock1990; Houghton & Hartley, Reference Houghton and Hartley1996). The following subsections provide further detail regarding the model’s medial and lateral streams.

Processing of sequential structure in medial premotor cortex

The GODIVA model posits that preSMA contains a representation of the global metrical structure of an upcoming speech utterance, whereas SMA is primarily responsible for initiating the motor execution of speech articulations. The SMA and preSMA elements in GODIVA are inspired in part by single unit electrophysiological studies of action sequencing in non-human primates. For example, Shima and Tanji (Reference Shima and Tanji2000) trained macaque monkeys to perform different sequences of three hand/arm movements (e.g., push-pull-turn) while recording from neurons in SMA and preSMA. Broadly speaking, neurons in SMA were more closely tied to particular movements, whereas neurons in preSMA often represented more global aspects of the full sequence, for example neurons that fired at the beginning of only one particular three-movement sequence, or neurons that fired during production of the second (or first, or third) movement of the sequence regardless of whether the movement was a push, pull, or turn. Subsequent human neuroimaging studies found a corresponding association between speech sequence complexity and preSMA activation (Bohland & Guenther, Reference Bohland and Guenther2006; Rong, Isenberg, Sun, & Hickok, Reference Rong, Isenberg, Sun and Hickok2018).

In GODIVA, preSMA nodes represent the syllable frame structure and stress patterning of the utterance, which determine the utterance’s metrical structure. Projections from preSMA nodes to SMA are responsible (in concert with the basal ganglia, as described below) for activating and deactivating the proper SMA initiation map nodes (each of which launches a distinct motor program) in the proper order and with the proper stress. In this way, the medial stream of the GODIVA model dictates the metrical structure/tempo of a multi-sound utterance.

Phonological content buffering in lateral prefrontal cortex

According to GODIVA, pIFS contains a phonological content buffer for temporarily storing the phonological units of an upcoming utterance. This function is assigned to left IFS based on demonstrations of its role in working memory (Kerns, Cohen, Stenger, & Carter, Reference Kerns, Cohen, Stenger and Carter2004; Gabrieli, Poldrack, & Desmond, Reference Gabrieli, Poldrack and Desmond1998; Kumar, Joseph, Gander, Barascud, Halpern, & Griffiths, Reference Kumar, Joseph, Gander, Barascud, Halpern and Griffiths2016), particularly verbal working memory (Rottschy, Langner, Dogan, Reetz, Laird, Schulz, Fox, & Eickhoff, Reference Rottschy, Langner, Dogan, Reetz, Laird, Schulz, Fox and Eickhoff2012), as well as its encoding of phonological identity and complexity (Poldrack, Wagner, Prull, Desmond, Glover, & Gabrieli, Reference Poldrack, Wagner, Prull, Desmond, Glover and Gabrieli1999; Bohland & Guenther, Reference Bohland and Guenther2006; Myers et al., Reference Myers, Blumstein, Walsh and Eliassen2009). Activity in this region also is associated with acquisition of phonetic categories in infants during the first year of life (Imada, Zhang, Cheour, Taulu, Ahonen, & Kuhl, Reference Imada, Zhang, Cheour, Taulu, Ahonen and Kuhl2006).

Each node in the phonological content buffer represents a different phonological unit (e.g., a phoneme or consonant cluster). The order of upcoming speech sounds to be produced is represented by the gradient of activity across these nodes. GODIVA, like the DIVA model, implements speech sound map nodes residing in vPMC. Once pIFS selects the next motor program to execute, as determined by the highest-activity node in its phonological buffer, this selection is transmitted to left vPMC via projections from pIFS. Execution of the motor program begins at the instant the corresponding SMA initiation map node is activated (at which time the sound’s representation is deleted from the pIFS phonological content buffer), and the motor program terminates when the initiation map node activity is extinguished.

Motor sequence chunking and automatization in the basal ganglia loop

We propose that, early in development, the working memory areas preSMA and pIFS must be heavily involved in the speech sequencing process since frequently occurring sequences haven’t yet been “automated” by transferring control of the sequence to subcortical structures. In GODIVA, if a particular movement sequence is repeated many times, nodes in the basal ganglia learn to recognize the sensorimotor context for initiating the individual items in the sequence. After learning, the sequence is represented by its own speech sound map node, and activating this node leads to readout of the learned movement sequence. The learning process is schematized in Figure 3.

Figure 3. Illustration of speech sequence learning via “chunking” in the GODIVA model. (A) Network involved in producing the word “snow” early in speech motor development. Cortico-cortical projections are indicated by black arrows. (B) Network involved in producing the word “snow” later in development. The development of basal ganglia (red dashed arrows) and cerebellar (green dashed arrows) loops allow for the use of fewer cortical nodes and projections. [Abbreviations: BG, basal ganglia; Cb, cerebellum; G, gestural node; I, initiation map node; pIFS, posterior inferior frontal sulcus; preSMA, presupplementary motor area; S, syllabic structure node; SMA, supplementary motor area; vMC, ventral primary motor cortex; vPMC, ventral premotor cortex].

The cortico-basal ganglia motor loop accomplishes this automation of frequently used speech sequences in early childhood by encoding these sequences as “chunks” with their own optimized motor programs. This chunking would reduce the processing load on prefrontal and premotor cortical areas (Alm, Reference Alm2004; Redgrave et al., Reference Redgrave, Rodriguez, Smith, Rodriguez-Oroz, Lehericy, Bergman and Obeso2010). For example, the speech motor system of a young child might attempt to produce the word “snow” (Figure 3, Panel A). vMC contains nodes encoding articulatory gestures (labeled G) for the phonemes /s/, /n/, and /ō/. Each phonemic gesture has a corresponding cell in the SMA initiation map (labeled I) that is responsible for initiating the gesture via projections to vMC. During this early stage of development, vPMC does not contain a motor program for the entire syllable /snō/. Instead, the syllable is represented by individual motor programs for each phoneme that must be activated independently via inputs from the IFS phonological buffer. Similarly, preSMA and pIFS contain only phonemic elements, not larger units such as consonant clusters.

At this stage, production of the word requires activation of the nodes /s/, /n/, and /ō/ in the phonological content buffer in pIFS, as well as the structural representation for /snō/ in the sequential structure buffer in preSMA. Projections from pIFS sequentially activate the vPMC nodes corresponding to the motor programs for /s/, /n/, and /ō/. Projections from these vPMC nodes sequentially activate the matching gestural nodes in vMC. The timing of this sequential activation process is determined by the medial premotor areas. PreSMA-to-SMA projections activate nodes in the initiation map for the individual phonemes in the proper order and with the proper timing. Once a motor program has been completed, the pIFS, vPMC, and pIFS nodes for that program’s elements are deactivated, allowing the next motor program to commence.

Panel B of Figure 3 schematizes the production of /snō/ at a more mature stage of development. At this stage, vPMC contains a motor program for the entire syllable /snō/, with subcortical loops through the cerebellum (green dashed arrows) effectively taking over coordination of the individual motor gestures. The importance of the cerebellum for vocal sequence learning has been empirically supported by pediatric clinical studies and animal lesion models (Ziegler & Ackermann, Reference Ziegler and Ackermann2017; Pidoux, Blanc, Levenes, & Leblois, Reference Pidoux, Blanc, Levenes, Leblois, Raymond and King2018; Glickstein, Reference Glickstein1994). Once these cortical-subcortical loops are established, working memory buffers in preSMA and pIFS will contain cluster-sized sub-syllabic units, thereby reducing the number of items that have to be stored in working memory for /snō/. The task of initiating the gesture for /n/ in /snō/ now gets carried out by the basal ganglia motor loop (red dashed arrow) instead of preSMA.

This learning process reduces the number of pIFS, preSMA, and vPMC nodes that must be activated to produce the word. The required number of cortico-cortical connections (black arrows) has decreased substantially, having been replaced by subcortical communications through the cerebellum (green arrows) and basal ganglia (red arrows). Evidence for speech learning-related reductions in processing load has been demonstrated by neuroimaging studies of nonnative consonant cluster learning (Segawa, Tourville, Beal, & Guenther, Reference Segawa, Tourville, Beal and Guenther2015; Masapollo et al., Reference Masapollo, Segawa, Beal, Tourville, Nieto-Castañón, Heyne and Guenther2021).

Summary

This review described neuro-computational approaches for modeling infant and child speech motor development. We first provided an overview of the DIVA model, which characterizes feedforward and feedback mechanisms of speech production controlled by a network of cortical and subcortical loops. The feedforward control system is thought to involve cortico-cortical projections from premotor to motor cortex, as well as contributions from the cerebellum. The auditory and somatosensory feedback control systems monitor the perceptual consequences of speech output, which are compared to sensory predictions transmitted from premotor cortex to higher-order sensory areas. These sensory areas compute error signals, which are sent to motor cortex as corrective motor commands.

We described how early stages of speech motor learning can be simulated with the DIVA model. Speech motor development involves a number of learning processes occurring in a quasi-parallel fashion. Infant babbling and other vocalizations begin tuning forward maps which map motor outputs to resulting auditory and somatosensory perceptions. Auditory maps develop in a way that highlights important acoustic distinctions in a language and de-emphasizes irrelevant distinctions. Analogously, somatosensory maps become sensitive to the tactile and proprioceptive feedback patterns that occur when producing sounds from the native language. Auditory targets for speech sound “chunks” such as phonemes, syllables, and words are formed by monitoring the environment for native language samples, and feedforward commands are tuned as a child attempts to produce these sound chunks.

Next, we addressed computational modeling of a more advanced stage of child speech development, in which longer phonological sequences such as phrases or sentences are produced. Modeling of these processes uses the Gradient Order DIVA (GODIVA) model. High-level language processing regions maintain temporary stores of upcoming phonological content and metrical structure in competitive queues. These regions control the output of the downstream initiation maps and speech sound maps to produce sequences of speech sounds. GODIVA also describes a mechanism of speech sequence learning, or chunking, via cortico-basal ganglia loops. Frequently produced motor sequences that formerly required cortical control for every sequential step are automated into syllabic motor programs controlled mostly by the basal ganglia and cerebellum, reducing cortical processing load as the child proceeds through speech development.

Acknowledgements

This research was funded by the following grants from the National Institutes of Health: R01 DC007683 (F. Guenther, PI), R01 DC016270 (C. Stepp and F. Guenther, MPIs), U01 NS117836, (M. Richardson, PI), R01 019354 (M. Long, PI).

Competing interest

No competing interests to disclose.

Footnotes

1 In model simulations, the speech recognition system is not implemented; instead, sound identity is provided by the modeler, who labels the speech sounds presented to the model for learning.

References

Alm, P. A. (2004). Stuttering and the Basal Ganglia Circuits: A Critical Review of Possible Relations. Journal of Communication Disorders, 37, 325369.10.1016/j.jcomdis.2004.03.001CrossRefGoogle ScholarPubMed
Bohland, J. W., Bullock, D., & Guenther, F. H. (2010). Neural Representations and Mechanisms for the Performance of Simple Speech Sequences. Journal of Cognitive Neuroscience, 22, 15041529.10.1162/jocn.2009.21306CrossRefGoogle ScholarPubMed
Bohland, J. W., & Guenther, F. H. (2006). An FMRI Investigation of Syllable Sequence Production. NeuroImage, 32, 821841.10.1016/j.neuroimage.2006.04.173CrossRefGoogle ScholarPubMed
Boysson-Bardies, B. de, Hallé, P., Sagart, L., & Durand, C. (1989). A Crosslinguistic Investigation of Vowel Formants in Babbling. Journal of Child Language, 16, 117.10.1017/S0305000900013404CrossRefGoogle ScholarPubMed
Boysson-Bardies, B. de, Sagart, L., & Durand, C. (1984). Discernible Differences in the Babbling of Infants According to Target Language. Journal of Child Language, 11, 115.10.1017/S0305000900005559CrossRefGoogle ScholarPubMed
Callan, D. E., Kent, R. D., Guenther, F. H., & Vorperian, H. K. (2000). An Auditory-Feedback-Based Neural Network Model of Speech Production That Is Robust to Developmental Changes in the Size and Shape of the Articulatory System. Journal of Speech, Language, and Hearing Research: JSLHR, 43, 721736.10.1044/jslhr.4303.721CrossRefGoogle ScholarPubMed
Chenausky, K. V., Brignell, A., Morgan, A. T., Norton, A. C., Tager-Flusberg, H. B., Schlaug, G., & Guenther, F. H. (2021). A Modeling-Guided Case Study of Disordered Speech in Minimally Verbal Children With Autism Spectrum Disorder. American Journal of Speech-Language Pathology, 30, 15421557.CrossRefGoogle ScholarPubMed
Civier, O., Bullock, D., Max, L., & Guenther, F. H. (2013). Computational Modeling of Stuttering Caused by Impairments in a Basal Ganglia Thalamo-Cortical Circuit Involved in Syllable Selection and Initiation. Brain and Language, 126, 263278.CrossRefGoogle Scholar
Civier, O., Tasko, S. M., & Guenther, F. H. (2010). Overreliance on Auditory Feedback May Lead to Sound/Syllable Repetitions: Simulations of Stuttering and Fluency-Inducing Conditions with a Neural Model of Speech Production. Journal of Fluency Disorders, 35, 246279.10.1016/j.jfludis.2010.05.002CrossRefGoogle ScholarPubMed
DeFelipe, J., & Fariñas, I. (1992). The pyramidal neuron of the cerebral cortex: Morphological and chemical characteristics of the synaptic inputs. Progress in Neurobiology, 39, 563607.10.1016/0301-0082(92)90015-7CrossRefGoogle ScholarPubMed
Doyon, J., Bellec, P., Amsel, R., Penhune, V., Monchi, O., Carrier, J., … Benali, H. (2009). Contributions of the basal ganglia and functionally related brain structures to motor learning. Behavioural Brain Research, 199, 6175.10.1016/j.bbr.2008.11.012CrossRefGoogle ScholarPubMed
Dubois, J., Dehaene-Lambertz, G., Kulikova, S., Poupon, C., Hüppi, P. S., & Hertz-Pannier, L. (2014). The early development of brain white matter: A review of imaging studies in fetuses, newborns and infants. Neuroscience, 276, 4871.10.1016/j.neuroscience.2013.12.044CrossRefGoogle ScholarPubMed
Gabrieli, J. D. E., Poldrack, R. A., & Desmond, J. E. (1998). The Role of Left Prefrontal Cortex in Language and Memory. Proceedings of the National Academy of Sciences, 95, 906913.CrossRefGoogle ScholarPubMed
Glickstein, M. (1994). Cerebellar Agenesis. Brain, 117, 12091212.CrossRefGoogle ScholarPubMed
Golfinopoulos, E., Tourville, J. A., & Guenther, F. H. (2010). The Integration of Large-Scale Neural Network Modeling and Functional Brain Imaging in Speech Motor Control. NeuroImage, Computational Models of the Brain, 52, 862874.Google ScholarPubMed
Golfinopoulos, E., Tourville, J. A., Bohland, J. W., Ghosh, S. S., Nieto-Castanon, A., & Guenther, F. H. (2011). FMRI Investigation of Unexpected Somatosensory Feedback Perturbation during Speech. Neuroimage, 55, 13241338.CrossRefGoogle ScholarPubMed
Gratier, M., & Devouche, E. (2011). Imitation and repetition of prosodic contour in vocal interaction at 3 months. Developmental Psychology, 47, 67.CrossRefGoogle ScholarPubMed
Grossberg, S. (1978). A theory of human memory: Self-organization and performance of sensory-motor codes, maps, and plans. Prog Theor Biol, 5, 233374.CrossRefGoogle Scholar
Guenther, F. H. (1995). Speech Sound Acquisition, Coarticulation, and Rate Effects in a Neural Network Model of Speech Production. Psychological Review, 102, 594.10.1037/0033-295X.102.3.594CrossRefGoogle Scholar
Guenther, F. H. (2016). Neural Control of Speech. Mit Press.10.7551/mitpress/10471.001.0001CrossRefGoogle Scholar
Guenther, F. H., & Gjaja, M. N. (1996). The Perceptual Magnet Effect as an Emergent Property of Neural Map Formation. The Journal of the Acoustical Society of America, 100, 11111121.10.1121/1.416296CrossRefGoogle ScholarPubMed
Guenther, F. H., Husain, F. T., Cohen, M. A., & Shinn-Cunningham, B. G. (1999). Effects of Categorization and Discrimination Training on Auditory Perceptual Space. The Journal of the Acoustical Society of America, 106, 29002912.CrossRefGoogle ScholarPubMed
Hao, Y. C., & Jong, K. (2016). Imitation of second language sounds in relation to L2 perception and production. Journal of Phonetics, 54, 151168.10.1016/j.wocn.2015.10.003CrossRefGoogle Scholar
Hashimoto, Y., & Sakai, K. L. (2003). Brain activations during conscious self‐monitoring of speech production with delayed auditory feedback: An fMRI study. Human Brain Mapping, 20, 2228.10.1002/hbm.10119CrossRefGoogle ScholarPubMed
Hickok, G. (2014). The architecture of speech production and the role of the phoneme in speech processing. Language, Cognition and Neuroscience, 29, 220.10.1080/01690965.2013.834370CrossRefGoogle ScholarPubMed
Hikosaka, O., Nakamura, K., Sakai, K., & Nakahara, H. (2002). Central mechanisms of motor skill learning. Current opinion in neurobiology, 12, 217222.10.1016/S0959-4388(02)00307-0CrossRefGoogle ScholarPubMed
Houde, J. F., & Jordan, M. I. (1998). Sensorimotor Adaptation in Speech Production. Science, 279, 12131216.10.1126/science.279.5354.1213CrossRefGoogle ScholarPubMed
Houghton, G. (1990). The problem of serial order: A neural network model of sequence learning and recall. In Dale, R., Mellish, C., & Zock, M. (Eds.), Current research in natural language generation (pp. 287319). San Diego: Academic Press.Google Scholar
Houghton, G., & Hartley, T. (1996). Parallel Models of Serial Behaviour: Lashley Revisited. Psyche: An Interdisciplinary Journal of Research on Consciousness, 2.Google Scholar
Imada, T., Zhang, Y., Cheour, M., Taulu, S., Ahonen, A., & Kuhl, P. K. (2006). Infant Speech Perception Activates Broca’s Area: A Developmental Magnetoencephalography Study. NeuroReport, 17, 957962.10.1097/01.wnr.0000223387.51704.89CrossRefGoogle ScholarPubMed
Jones, S. S. (2009). The development of imitation in infancy. Philosophical Transactions of the Royal Society B: Biological Sciences, 364, 23252335.10.1098/rstb.2009.0045CrossRefGoogle ScholarPubMed
Kehoe, M., & Stoel-Gammon, C. (1997). The Acquisition of Prosodic Structure: An Investigation of Current Accounts of Children’s Prosodic Development. Language, 73, 113144.10.2307/416597CrossRefGoogle Scholar
Kerns, J. G., Cohen, J. D., Stenger, V. A., & Carter, C. S. (2004). Prefrontal Cortex Guides Context-Appropriate Responding during Language Production. Neuron, 43, 283291.10.1016/j.neuron.2004.06.032CrossRefGoogle ScholarPubMed
Kessler, B., & Treiman, R. (1997). Syllable Structure and the Distribution of Phonemes in English Syllables. Journal of Memory and Language, 37, 295311.10.1006/jmla.1997.2522CrossRefGoogle Scholar
Kokkinaki, T., & Kugiumutzakis, G. (2000). Basic aspects of vocal imitation in infant-parent interaction during the first 6 months. Journal of Reproductive and Infant Psychology, 18, 173187.CrossRefGoogle Scholar
Kostović, I., & Jovanov-Milošević, N. (2006). The development of cerebral connections during the first 20–45 weeks’ gestation. In Seminars in Fetal and Neonatal Medicine, 11, 415422.CrossRefGoogle ScholarPubMed
Kozhevnikov, V. A., & Chistovich, L. A. (1965). Speech: Articulation and Perception. Speech: Articulation and Perception. Oxford, England: Nauka.Google Scholar
Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., & Lindblom, B. (1992). Linguistic Experience Alters Phonetic Perception in Infants by 6 Months of Age. Science, 255, 606608.10.1126/science.1736364CrossRefGoogle ScholarPubMed
Kuhl, P. K., & Meltzoff, A. N. (1996). Infant vocalizations in response to speech: Vocal imitation and developmental change. The Journal of the Acoustical Society of America, 100, 24252438.10.1121/1.417951CrossRefGoogle ScholarPubMed
Kumar, S., Joseph, S., Gander, P. E., Barascud, N., Halpern, A. R., & Griffiths, T. D. (2016). A Brain System for Auditory Working Memory. Journal of Neuroscience, 36, 44924505.10.1523/JNEUROSCI.4341-14.2016CrossRefGoogle ScholarPubMed
Lametti, D. R., Nasir, S. M., & Ostry, D. J. (2012). Sensory Preference in Speech Production Revealed by Simultaneous Alteration of Auditory and Somatosensory Feedback. Journal of Neuroscience, 32, 93519358.10.1523/JNEUROSCI.0404-12.2012CrossRefGoogle ScholarPubMed
Lashley, K. S. (1951). The problem of serial order in behavior. In Jeffress, L. (Ed.), Cerebral mechanisms in behavior (pp. 112136). New York: Wiley.Google Scholar
Levelt, W. J. M. (1993). Speaking: From Intention to Articulation. https://doi.org/10.7551/mitpress/6393.001.0001CrossRefGoogle Scholar
Levitt, A. G., & Utman, J. G. A. (1992). From Babbling towards the Sound Systems of English and French: A Longitudinal Two-Case Study*. Journal of Child Language, 19, 1949.CrossRefGoogle Scholar
Li, L. Y., Ji, X. Y., Liang, F., Li, Y. T., Xiao, Z., Tao, H. W., & Zhang, L. I. (2014). A feedforward inhibitory circuit mediates lateral refinement of sensory representation in upper layer 2/3 of mouse primary auditory cortex. Journal of Neuroscience, 34, 1367013683.CrossRefGoogle ScholarPubMed
MacNeilage, P. F., & Davis, B. L. (1990). Acquisition of Speech Production: The Achievement of Segmental Independence. In Speech Production and Speech Modelling (pp. 5568). Springer.10.1007/978-94-009-2037-8_3CrossRefGoogle Scholar
Masapollo, M., Segawa, J. A., Beal, D. S., Tourville, J. A., Nieto-Castañón, A., Heyne, M., … Guenther, F. H. (2021). Behavioral and Neural Correlates of Speech Motor Sequence Learning in Stuttering and Neurotypical Speakers: An FMRI Investigation. Neurobiology of Language, 2, 106137.CrossRefGoogle ScholarPubMed
Mattys, S. L., Jusczyk, P. W., Luce, P. A., & Morgan, J. L. (1999). Phonotactic and Prosodic Effects on Word Segmentation in Infants. Cognitive Psychology, 38, 465494.CrossRefGoogle ScholarPubMed
Max, L., Guenther, F. H., Gracco, V. L., Ghosh, S. S., & Wallace, M. E. (2004). Unstable or Insufficiently Activated Internal Models and Feedback-Biased Motor Control as Sources of Dysfluency: A Theoretical Model of Stuttering. Contemporary Issues in Communication Science and Disorders, 31, 105122.10.1044/cicsd_31_S_105CrossRefGoogle Scholar
Miller, H. E., & Guenther, F. H. (2021). Modelling Speech Motor Programming and Apraxia of Speech in the DIVA/GODIVA Neurocomputational Framework. Aphasiology, 35, 424441.10.1080/02687038.2020.1765307CrossRefGoogle ScholarPubMed
Mitchell, P. R., & Kent, R. D. (1990). Phonetic Variation in Multisyllable Babbling*. Journal of Child Language, 17, 247265.CrossRefGoogle ScholarPubMed
Myers, E. B., Blumstein, S. E., Walsh, E., & Eliassen, J. (2009). Inferior Frontal Regions Underlie the Perception of Phonetic Category Invariance. Psychological Science, 20, 895903.10.1111/j.1467-9280.2009.02380.xCrossRefGoogle ScholarPubMed
Naskar, S., Qi, J., Pereira, F., Gerfen, C. R., & Lee, S. (2021). Cell-type-specific recruitment of GABAergic interneurons in the primary somatosensory cortex by long-range inputs. Cell reports, 34, 108774.10.1016/j.celrep.2021.108774CrossRefGoogle ScholarPubMed
Nathani, S., Ertmer, D. J., & Stark, R. E. (2006). Assessing Vocal Development in Infants and Toddlers. Clinical Linguistics & Phonetics, 20, 351369.CrossRefGoogle ScholarPubMed
Nozari, N., Dell, G. S., & Schwartz, M. F. (2011). Is comprehension necessary for error detection? A conflict-based account of monitoring in speech production. Cognitive Psychology, 63, 133.10.1016/j.cogpsych.2011.05.001CrossRefGoogle ScholarPubMed
Okada, K., Matchin, W., & Hickok, G. (2018). Neural evidence for predictive coding in auditory cortex during speech production. Psychonomic Bulletin & Review, 25, 423430.CrossRefGoogle ScholarPubMed
Oller, D. K. (1980). The Emergence of the Speech Capacity in Infancy. In Yeni-komshian, G. R. A. C. E. H., Kavanagh, J. A. M. E. S. F., & Ferguson, C. H. A. R. L. E. S. A. (Eds.), Child Phonology (pp. 93112). Academic Press.CrossRefGoogle Scholar
Ozker, M., Doyle, W., Devinsky, O., & Flinker, A. (2022). A cortical network processes auditory error signals during human speech production to maintain fluency. PLoS Biology, 20, 3001493.10.1371/journal.pbio.3001493CrossRefGoogle ScholarPubMed
Pidoux, L., Blanc, P. L., Levenes, C., & Leblois, A. (2018). A Subcortical Circuit Linking the Cerebellum to the Basal Ganglia Engaged in Vocal Learning (Raymond, J. L. & King, A. J., Eds.). https://doi.org/10.7554/eLife.32167CrossRefGoogle Scholar
Poldrack, R. A., Wagner, A. D., Prull, M. W., Desmond, J. E., Glover, G. H., & Gabrieli, J. D. E. (1999). Functional Specialization for Semantic and Phonological Processing in the Left Inferior Prefrontal Cortex. NeuroImage, 10, 1535.CrossRefGoogle ScholarPubMed
Redgrave, P., Rodriguez, M., Smith, Y., Rodriguez-Oroz, M. C., Lehericy, S., Bergman, H., … Obeso, J. A. (2010). Goal-Directed and Habitual Control in the Basal Ganglia: Implications for Parkinson’s Disease. Nature Reviews Neuroscience, 11, 760772.10.1038/nrn2915CrossRefGoogle ScholarPubMed
Rong, F., Isenberg, A. L., Sun, E., & Hickok, G. (2018). The neuroanatomy of speech sequencing at the syllable level. PloS One, 13, 0196381.CrossRefGoogle ScholarPubMed
Rottschy, C., Langner, R., Dogan, I., Reetz, K., Laird, A. R., Schulz, J. B., Fox, P. T., & Eickhoff, S. B. (2012). Modelling neural correlates of working memory: a coordinate-based meta-analysis. NeuroImage, 60(1), 830846. https://doi.org/10.1016/j.neuroimage.2011.11.050CrossRefGoogle ScholarPubMed
Segawa, J. A., Tourville, J. A., Beal, D. S., & Guenther, F. H. (2015). The Neural Correlates of Speech Motor Sequence Learning. Journal of Cognitive Neuroscience, 27, 819831.10.1162/jocn_a_00737CrossRefGoogle ScholarPubMed
Shima, K., & Tanji, J. (2000). Neuronal Activity in the Supplementary and Presupplementary Motor Areas for Temporal Organization of Multiple Movements. Journal of Neurophysiology, 84, 21482160.CrossRefGoogle ScholarPubMed
Stark, R. E. (1980). Stages of Speech Development in the First Year of Life. In Child Phonology (pp. 7392). Elsevier.10.1016/B978-0-12-770601-6.50010-3CrossRefGoogle Scholar
Sun, Y., & Poeppel, D. (2022). Syllables and Their Beginnings Have a Special Role in the Mental Lexicon. PsyArXiv. https://doi.org/10.31234/osf.io/c9tx2CrossRefGoogle Scholar
Terband, H., Maassen, B., Guenther, F. H., & Brumberg, J. (2009). Computational Neural Modeling of Speech Motor Control in Childhood Apraxia of Speech (CAS. Journal of Speech, Language, and Hearing Research, 52, 15951609.10.1044/1092-4388(2009/07-0283)CrossRefGoogle ScholarPubMed
Tourville, J. A., & Guenther, F. H. (2011). The DIVA Model: A Neural Theory of Speech Acquisition and Production. Language and Cognitive Processes, 26, 952981.10.1080/01690960903498424CrossRefGoogle Scholar
Urrutia-Piñones, J., Morales-Moraga, C., Sanguinetti-González, N., Escobar, A. P., & Chiu, C. Q. (2022). Long-range gabaergic projections of cortical origin in brain function. Frontiers in Systems Neuroscience, 16.10.3389/fnsys.2022.841869CrossRefGoogle ScholarPubMed
Weerathunge, H. R., Alzamendi, G. A., Cler, G. J., Guenther, F. H., Stepp, C. E., & Zañartu, M. (2022). LaDIVA: A Neurocomputational Model Providing Laryngeal Motor Control for Speech Acquisition and Production. PLOS Computational Biology, 18, 1010159.10.1371/journal.pcbi.1010159CrossRefGoogle ScholarPubMed
Ziegler, W., & Ackermann, H. (2017). Subcortical Contributions to Motor Speech: Phylogenetic, Developmental, Clinical. Trends in Neurosciences, 40, 458468.10.1016/j.tins.2017.06.005CrossRefGoogle ScholarPubMed
Figure 0

Figure 1. Neural correlates of the DIVA model. The main neural output of the model is provided by the vMC Articulator Map, which integrates feedforward commands from VL and the Speech Sound Map with feedback commands from VL and the Feedback Control Map. [Abbreviations: Cb=cerebellum (specific lobule unknown); Cb-VI=cerebellum lobule VI; GP=globus pallidus; MG=medial geniculate nucleus of the thalamus; pAC=posterior auditory cortex; SMA=supplementary motor area; SNr=substantia nigra pars reticula; VA=ventral anterior nucleus of the thalamus; VL=ventral lateral nucleus of the thalamus; vMC=ventral motor cortex; VPM=ventral posterior medial nucleus of the thalamus; vPMC=ventral premotor cortex; vSC=ventral somatosensory cortex.].

Figure 1

Table 1. Time-courses for development of the major capacities of the speech motor system. The estimated amount of learning occurring in a neural system within a given time window is indicated as being Low, Medium, or High. [Abbreviations: Aud.=auditory; Som.=somatosensory.]

Figure 2

Figure 2. Simplified schematic of the GODIVA network model for speech sequence production. [Abbreviations: GP, globus pallidus; pIFS, posterior inferior frontal sulcus; preSMA, presupplementary motor area; SMA, supplementary motor area; VA, ventral anterior thalamic nucleus; VL, ventral lateral thalamic nucleus; vPMC, ventral premotor cortex].

Figure 3

Figure 3. Illustration of speech sequence learning via “chunking” in the GODIVA model. (A) Network involved in producing the word “snow” early in speech motor development. Cortico-cortical projections are indicated by black arrows. (B) Network involved in producing the word “snow” later in development. The development of basal ganglia (red dashed arrows) and cerebellar (green dashed arrows) loops allow for the use of fewer cortical nodes and projections. [Abbreviations: BG, basal ganglia; Cb, cerebellum; G, gestural node; I, initiation map node; pIFS, posterior inferior frontal sulcus; preSMA, presupplementary motor area; S, syllabic structure node; SMA, supplementary motor area; vMC, ventral primary motor cortex; vPMC, ventral premotor cortex].