Introduction
Acquiring novel L2 sounds in a foreign language context can be challenging particularly given the probable scarcity of authentic target language input available. Against this background, a possible source of specialized target language experience can be found in high variability phonetic training (HVPT), which exposes learners to highly variable stimuli (i.e., a variety of talkers, stimuli, and phonetic contexts) to provide them with the kind of variability present in real communicative situations. Thus, HVPT allows learners to attend to the aspects of the stimuli that are crucial to identifying and distinguishing L2 categories and to disregard talker-specific or context-specific characteristics (Lively, Logan & Pisoni, Reference Lively, Logan and Pisoni1993). In addition, HVPT is used to draw learners’ attention to particularly challenging target structures through the use of immediate corrective feedback (Logan & Pruitt, Reference Logan, Pruitt and Strange1995; Thomson, Reference Thomson2018). The efficacy of phonetic training is typically assessed by contrasting the trained learners’ performance before and after training and also in comparison with a control group of untrained learners. Furthermore, HVPT is expected to promote the generalization of learning to untrained structures, such as new talkers, sounds, stimuli, and phonetic contexts, which is believed to indicate the formation of robust L2 categories (Logan & Pruitt, Reference Logan, Pruitt and Strange1995). The results of numerous studies generally support the efficacy of HVPT for enhancing the perception and production of L2 sounds, and promoting generalization of knowledge (see Thomson, Reference Thomson2018, for an overview). For instance, several studies comparing HVPT with low variability phonetic training (LVPT, stimuli from a single talker) have shown that only the former results in generalization of learning (Lively et al., Reference Lively, Logan and Pisoni1993; Perrachione, Lee, Ha & Wong, Reference Perrachione, Lee, Ha and Wong2011), although some recent studies have questioned this advantage of HVPT over LVPT and point to other factors in addition to talker variability that may contribute to generalization (Brekelmans, Lavan, Saito, Clayards & Wonnacott, Reference Brekelmans, Lavan, Saito, Clayards and Wonnacott2022; Zhang, Cheng & Zhang Reference Zhang, Cheng and Zhang2021a; Zhang, Cheng, Qin & Zhang Reference Zhang, Cheng, Qin and Zhang2021b). Another measure of robust learning is the degree to which the improvement obtained through training is maintained for some time after training has ended. This is referred to as retention and is typically measured by means of a delayed test, equal to the pretests and posttests. HVPT studies have found evidence of retention of learning up to three months after training (Lively, Pisoni, Yamada, Tohkura & Yamada, Reference Lively, Pisoni, Yamada, Tohkura and Yamada1994), and very few report retention after a longer period (Iverson & Evans, Reference Iverson and Evans2009; Thomson, Reference Thomson2018). The current study thus explores the potential of HVPT further by contrasting the effect of two different perceptual training methods, namely identification and discrimination tasks, on both the ability to identify and to discriminate target language sounds, within a single study, unlike most previous works. In addition, the effectiveness of the training tasks is evaluated by analyzing if the training methods result in generalization and retention of learning. Hence, the learners’ ability to identify and discriminate target sounds in stimuli not used in training (new nonword and real word stimuli produced by new talkers) is examined right before and after training, and four months after training. The characteristics of the two training methods are described next.
Perceptual training tasks
Most perceptual training studies make use of identification (ID) tasks (e.g., Lengeris & Hazan, Reference Lengeris and Hazan2010; Iverson, Pinet & Evans, Reference Iverson, Pinet and Evans2012), which require listeners to identify or label a given aural stimulus; some use discrimination (DIS) tasks (e.g., Strange & Dittman, Reference Strange and Dittmann1984; Georgiou, Reference Georgiou2021), in which listeners indicate if two (or more) aural stimuli belong to the same category or not, and some have used a combination of perceptual tasks (e.g., Shinohara & Iverson, Reference Shinohara and Iverson2018, Reference Shinohara and Iverson2021). Yet, few studies have actually compared the effectiveness of different tasks on different abilities in the same study. Generally, identification (ID) tasks have been found to improve identification (e.g., Lambacher, Martens, Kakehi, Marasinghe & Molholt, Reference Lambacher, Martens, Kakehi, Marasinghe and Molholt2005; Iverson & Evans, Reference Iverson and Evans2009) and discrimination (DIS) tasks successfully improve discrimination (Georgiou, Reference Georgiou2021). Some studies examining cross-task effects have reported a greater efficacy of ID training. For instance, Jamieson and Morosan (Reference Jamieson and Morosan1986) found that identification training resulted in improved identification and discrimination of trained synthetic stimuli as well as of untrained natural stimuli. By contrast, Strange and Dittman (Reference Strange and Dittmann1984) found that DIS training improved the identification and the discrimination of the English /r/-/l/ contrast but did not result in the generalization of learning to natural stimuli. However, these early studies used synthetic stimuli and did not involve high variability, which may account for the limited results reported for DIS training. Some later studies that employed HVPT have reported that ID training successfully improves identification but has little effect on discrimination (Lengeris & Hazan, Reference Lengeris and Hazan2010; Iverson et al., Reference Iverson, Pinet and Evans2012). Lengeris and Hazan (Reference Lengeris and Hazan2010) found that identification training with natural stimuli improved the identification but not the discrimination of L2 English vowels by Greek speakers. The fact that identification was tested under different conditions (natural and synthetic stimuli) but discrimination was tested only using synthetic stimuli may explain the lack of an effect of ID training on discrimination. Regarding Iverson et al. (Reference Iverson, Pinet and Evans2012), the advantage of ID training over DIS training may be the result of procedural learning for the former, as pre- and posttraining tests examined identification only, giving a task familiarity advantage to ID trainees.
Arguments for the apparent advantage of ID have also been linked to a difference in the nature of the two tasks, as DIS may draw listeners’ attention to variability within the same category and tap into lower levels of phonological processing, while ID focuses on variability between categories and involves higher levels of phonological encoding that may be more relevant for L2 categorization (Jamieson & Morosan, Reference Jamieson and Morosan1986; Logan & Pruitt, Reference Logan, Pruitt and Strange1995; Iverson et al., Reference Iverson, Pinet and Evans2012). This difference is particularly notable when discrimination tasks involve auditory discrimination, that is, when the same trials contain physically identical stimuli, and different trials may involve physically different stimuli from the same phoneme category (Polka, Reference Polka1992; Strange, Reference Strange, Tohkura, Vatikitois-Bateson and Sagisaka1992). Discrimination tasks that are categorical in nature, that is, where the same trials consist of physically different stimuli representative of the same category (e.g., different productions by the same speaker or by different speakers), may in fact involve a similar level of processing as identification tasks. To explore this further, the current study compares the use of ID tasks and specifically categorical DIS tasks in HVPT.
Few studies have directly compared the effect of ID and DIS training in the same study. Flege (Reference Flege1995) examined whether a categorical AX discrimination task or a two-alternative forced choice identification task was more effective for training Mandarin learners of English to identify English word-final /d/ and /t/. The results indicated that both types of tasks were equally effective and led to the generalization of learning to untrained stimuli, contrary to earlier findings using auditory discrimination tasks (e.g., Strange & Dittman, Reference Strange and Dittmann1984). Similarly, other studies have provided evidence for the effectiveness of both ID and categorical DIS tasks in improving the discrimination of Thai tones (Wayland & Li, Reference Wayland and Li2008), the use of cue-weighting in the perception of the English /iː/-/ɪ/ contrast (Wee, Grenon, Sheppard & Archibald, Reference Wee, Grenon, Sheppard, Archibald, Calhoun, Escudero, Tabain and Warren2019), and the identification and discrimination of the English /r/ and /l/ contrast (Shinohara & Iverson, Reference Shinohara and Iverson2018).
However, divergent results for ID training and categorical DIS training have also been reported. In a study involving Japanese learners of English, Nozawa (Reference Nozawa2015) compared the effect of ID training and categorical AXB DIS training on the identification of English vowels and coda nasals. In this case, the results showed that the tasks had a comparable positive effect of both ID and DIS training in the case of coda nasals, but ID training yielded better results with vowel identification. A similar finding was reported by Carlet and Cebrian (Reference Carlet and Cebrian2022), who also found a greater benefit of ID training for vowel identification but comparable effects of ID and DIS training on stop identification. Greater improvement with ID tasks has also been reported for the perception of the /z/ vs. /dz/ contrast in coda position (i.e., rose vs. roads, Law, Grenon, Sheppard & Archibald, Reference Law, Grenon, Sheppard, Archibald, Calhoun, Escudero, Tabain and Warren2019). On the other hand, Carlet and Cebrian (Reference Carlet and Cebrian2022) also report an example of a possible benefit of DIS tasks. Their study explored the effect of training on implicitly exposed but untargeted sounds, in addition to the specifically targeted sounds; two groups were trained on vowels and two other groups were trained on stops, with the same set of CVC stimuli, and all groups were tested on all (i.e., targeted and untargeted) sounds. Interestingly, only the AX DIS training led to an enhanced perception of untargeted L2 sounds. The authors provide several explanations for this difference, including the possibility that, unlike ID training, which directs the listeners’ attention to the sound that is to be identified, DIS training may allow listeners to attend to the whole stimulus, paying attention to other sounds present in the stimulus in addition to the targeted sounds.
In brief, previous studies comparing ID and (categorical) DIS tasks show comparable results (Flege, Reference Flege1995; Shinohara & Iverson, Reference Shinohara and Iverson2018) or a certain advantage of ID training (Nozawa, Reference Nozawa2015; Carlet & Cebrian, Reference Carlet and Cebrian2022). Yet, comparisons across studies are complicated due to the differences in study design. For instance, neither Nozawa (Reference Nozawa2015) nor Wee et al. (Reference Wee, Grenon, Sheppard, Archibald, Calhoun, Escudero, Tabain and Warren2019) included a control group or a test of generalization or retention. Further, except for Shinohara and Iverson (Reference Shinohara and Iverson2018), who tested both identification and discrimination, and Wayland and Li (Reference Wayland and Li2008), who tested discrimination only, most studies used only identification tasks at the pretest and posttest (Flege, Reference Flege1995; Nozawa, Reference Nozawa2015; Carlet & Cebrian, Reference Carlet and Cebrian2022), which may also have contributed to the advantage of ID training due to procedural learning. Finally, while Shinohara and Iverson (Reference Shinohara and Iverson2018) compared ID and DIS training on both identification and discrimination abilities, the study did not include a control group and discrimination training included both auditory discrimination and categorical discrimination tasks. The current study thus contrasts the effect of ID and DIS (specifically, categorical DIS) on both the ability to identify and discriminate L2 vowels, including a control group, and assessing generalization and retention of learning.
Generalization is examined in the current study by evaluating the learners’ ability to identify and discriminate L2 vowels in new nonwords and real words produced by new talkers after undergoing training with nonword stimuli. The use of nonword training stimuli responds to the need to avoid the potential effects of word familiarity and orthographic interference found with real words. In fact, previous works show that phonetically-oriented training using nonwords (as opposed to lexically-oriented training with real words) may be more efficient at forcing the trainees’ attention to the important phonetic details that facilitate the perception of different L2 categories, thus improving L2 perception (Carlet & Cebrian, Reference Carlet and Cebrian2022) and production (Thomson & Derwing, Reference Thomson, Derwing, Levis, Le, Lucic, Simpson and Vo2016; Ortega, Mora-Plaza & Mora, Reference Ortega, Mora-Plaza, Mora, Kirkova-Naskova, Henderson and Fouz-González2021; Mora, Ortega, Mora-Plaza & Aliaga-García, Reference Mora, Ortega, Mora-Plaza and Aliaga-García2022). On the other hand, studies also indicate that perception of L2 contrasts may be facilitated when sounds are presented in a lexical context. For instance, previous studies reported that adult L2 learners were better at discriminating (Mora, Reference Mora, Hazan and Iverson2005) and identifying (Rato & Carlet, Reference Rato and Carlet2020) challenging L2 phones in real words than in nonwords, showing that lexical representations may play a role in the perception of segmental L2 contrasts (Yamada, Tohkura & Kobayashi, Reference Yamada, Tohkura, Kobayashi, James and Leather1997).
The present study
The main purpose of the current study is to examine the effectiveness of two perceptual training tasks (identification [ID] and categorical discrimination [DIS]) for training Spanish/Catalan-speaking learners of L2 English to discriminate and identify challenging English vowel sounds. The efficacy of each perceptual task is assessed by comparing trainees to a group of untrained cohorts on their ability to identify and discriminate the target vowels in stimuli not present in training, namely in new nonword stimuli and real word stimuli produced by new talkers. In addition, the study also examines if the expected improvement in identification and discrimination as a result of HVPT is retained four months after the completion of the training regime. Based on previous research on HVPT, we expect that identification training will improve identification, and discrimination training will improve discrimination (e.g., Thomson, Reference Thomson2018). Further, given the categorical nature of the discrimination task used, we expect there will be cross-task effects and trainees will improve both the trained and the untrained ability (Flege, Reference Flege1995; Wayland & Li, Reference Wayland and Li2008; Shinohara & Iverson, Reference Shinohara and Iverson2018), although improvement in the trained ability may be greater due to procedural learning (Iverson et al., Reference Iverson, Pinet and Evans2012; Nozawa, Reference Nozawa2015). Finally, improvement is predicted to generalize to perception in new nonwords and in real words following previous studies that show an advantage of using nonword training stimuli, linked to a greater focus on phonetic form (Thomson & Derwing, Reference Thomson, Derwing, Levis, Le, Lucic, Simpson and Vo2016; Ortega et al., Reference Ortega, Mora-Plaza, Mora, Kirkova-Naskova, Henderson and Fouz-González2021; Carlet & Cebrian, Reference Carlet and Cebrian2022). Better overall performance with real words than with nonwords may be observed due to the role of lexical representations in L2 segmental perception (Yamada et al., Reference Yamada, Tohkura, Kobayashi, James and Leather1997).
Methodology
Participants
Participants in this study were, initially, 44 Spanish/Catalan bilingual speakers (average age 19.4 years, 39 females), who were first-year students of English studies at a public university in Barcelona. Their exposure to English was mostly through their university classes as none had spent more than two months in an English-speaking country. No hearing problems were reported. The 44 participants were randomly distributed into two experimental groups and a control group (CG), although eventually, only 38 participants completed all the tests: 13 in the ID training group (IDG), 14 in the DIS training group (DISG), and 11 in CG. All groups were tested before training (pretest), after training (posttest), and four months after that (delayed posttest). Participants in CG were untrained from the pretest to the posttest, although they were given a combined DIS+ID training after the posttest and completed a second posttest afterward (posttest2). All participants received a small stipend.
Stimuli
The focus of the study was the Southern Standard British English (SSBE) vowels /iː ɪ æ ʌ ɜː/, which are challenging for Catalan/Spanish learners of English, especially the /iː/-/ɪ/ and /æ/-/ʌ/ vowel contrasts (e.g., Cebrian, Gorba & Gavaldà, Reference Cebrian, Gorba and Gavaldà2021; Mora et al., Reference Mora, Ortega, Mora-Plaza and Aliaga-García2022). In the case of English /ɜː/, it was contrasted with two potentially confusable vowels, /ɛ/ and /ɑː/; therefore, within-trial contrasts in discrimination tasks and across-trial contrasts in identification tasks involved the /iː/-/ɪ/ and the /æ/-/ʌ/ vowel pairs, as well as the /ɛ/-/ɜː/ and the /ɑː/-/ɜː/ pairs. Thus, the stimuli consisted of monosyllabic CVC nonwords and real words containing the SSBE vowels /iː ɪ ɛ ɜː æ ʌ ɑː/, where the vowel was preceded and followed by an obstruent. The words were elicited from six talkers who were native speakers of SSBE and had spent most of their lives in the south of England (three females, three males, mean age: 27.8). None reported speaking any other languages fluently and/or having any knowledge of Spanish and Catalan. Stimuli were embedded in a carrier sentence that facilitated the pronunciation of the nonwords (e.g., It rhymes with badge: dagde. I say dadge now. I say dadge again). All recordings took place in a soundproof chamber at a university in London, England, using Cool Edit 2000 software, a Rode Simply NT1-A microphone, and an Edirol UA-25 audio interface, and they were digitized at a 44.1 kHz sampling rate and 16-bit quantification. Two speakers (a male and a female) provided the testing stimuli (new nonwords and real words), and the remaining four (two male, two female) provided the training stimuli (all nonwords). Three native English speakers identified the selected stimuli accurately and consistently in an identification and goodness rating task.
The stimuli used for training were nonwords (e.g., jeet, jit, dadge, dudge; see Table 1 for the complete list of training stimuli). There were 12 words per vowel except for vowel /ɜː/, for which there were four additional words to have enough nonword CVC sequences containing this vowel that could be contrasted with /ɑː/ and /ɛ/ in discrimination training.
* Note: Some /ɜː/ items appear twice as /ɜː/ was contrasted with /ɛ/ in half the trials and with /ɑː/ in the other half.
Testing stimuli, which were used in the pretest, posttest, posttest2, and delayed test, consisted of a new set of nonwords not used in training and a set of real words. Twenty-four real words and 24 nonwords were used in the discrimination test (four words per vowel except for /ɛ/ and /ɑː/, with two words each, contrasting with four /ɜː/ words). Twenty-six real words and nonwords were used in the identification tests, which were basically the same words used in the discrimination tests plus additional /ɑː/ and /ɛ/ words to obtain a balanced number of stimuli per vowel (see the procedure section and Appendix A).
Procedure
The training was carried out by means of a seven-alternative forced-choice identification task (ID) and a categorical same/different AX discrimination task (DIS). Participants in both training regimes (IDG and DISG) were presented with the same number of stimuli; ID involved stimuli being presented individually whereas DIS presented stimuli in pairs. Thus, there were twice as many trials in each ID training session as in each DIS session. The control group was trained between posttest and posttest2 with three DIS sessions followed by three ID sessions. Testing involved the identification and discrimination of the target vowels in nonwords and real words. The pretest, posttest, posttest2 (for CG) and delayed test were exactly the same. The participants also completed a perceptual assimilation task and a production task that are not reported in the current paper.
Training consisted of six 30-min sessions that took place over several weeks at a phonetics laboratory at a Spanish University. The software used was TP (Rauber, Rato, Kluge & Santos, Reference Rauber, Rato, Kluge and Santos2011). In identification training (ID), a stimulus (nonword) was delivered through headphones at a comfortable sound level. Seven response options with a phonetic symbol and example words (i.e., /æ/ ash/mass, /ɑː/ arm/palm, /e/ less/west, /ɜː/ earth/first, /ɪ/ fish/his, /iː/ cheese/leaf, /ʌ/ sun/thus) were displayed on the screen. Due to restrictions of the TP software, some of the phonetic symbols were displayed with regular characters (i.e., /a:/ for /ɑː/, /3:/ for /ɜː/, /I/ for /ɪ/, /^/ for /ʌ/). Participants clicked on one of the seven options and received immediate feedback indicating the correct response. At the end of each session, participants were shown a global result (% correct answers). Each identification training session consisted of 480 trials, with a break after 240 trials. For each 240-trial section, there were 36 trials involving /iː ɪ æ ʌ ɑː ɜː/ and 24 trials involving /ɛ/ (a smaller number given that this vowel was not expected to pose a problem and was included to be contrasted with /ɜː/ in discrimination). Discrimination training (DIS) was implemented by means of a categorical AX discrimination (same/different) task in which participants had to indicate whether two given stimuli (produced by a female and a male speaker) contained the same or different vowels. Participants responded by clicking on one of two options (same or different) displayed on the screen. The order of the vowels and the talkers was counterbalanced throughout the tasks. There were 120 same-category and 120 different-category trials per DIS session. For each set of 120 different-category trials, there were 40 involving /æ/-/ʌ/, /ɪ/-/iː/ and 20 for /ɜː/-/ɛ/, /ɜː/-/ɑː/. Regarding the 120 same-category trials, there were 20 involving /æ/-/æ/, /ʌ/-/ʌ/, /iː/-/iː/, /ɪ/-/ɪ/, /ɜː/-/ɜː/ and 10 for /ɛ/-/ɛ, /ɑː/-/ɑː/. Immediate feedback was provided after each trial (correct or incorrect answer), and a global result was given at the end of the session.
Regarding the pre- and posttraining tests (posttest, posttest2 for CG, and delayed test), the ID tests included four words per vowel, each word produced by a male and a female speaker, and repeated twice (except for /ɛ/, which had fewer stimuli as explained above). The total number of trials was 104 (see Appendix A for a list of all the stimuli, number of talkers, repetitions, and total number of trials per test). The response alternatives used in ID testing were has/mass, palm/arch, send/mess, sir/earth, his/lift, cheese/leaf, and sun/thus. Some of these words were different from the options used in training, but they were equivalent in terms of syllabic structure and final consonants, which were different from the ones found in training words. The AX DIS task contained 96 trials (48 same-category and 48 different-category trials). Different-category trials consisted of four pairs of words for /iː/-/ɪ/ and /æ/-/ʌ/, and two pairs of words for /ɑː/-/ɜː/ and /ɛ/-/ɜː/. Each pair of words appeared four times to counterbalance the order of the vowels (V1-V2, V2-V1) and the talkers (T1-T2, T2-T1). Same-category trials consisted of four pairs of words for vowels /iː ɪ æ ʌ ɜː/ and two word pairs for vowels /ɑː/ and /ɛ/, and the order of talkers was also counterbalanced. The interstimulus interval was 1.15 s, long enough to prevent reliance on sensory memory and facilitate access to phonetic information stored in long-term memory (e.g., Højen and Flege, Reference Højen and Flege2006). The DIS and ID tests were completed on the same day and were the only tasks completed that day. The order of the tests was the following: DIS real words, ID real words, DIS nonwords, ID nonwords. The first DIS and ID tests were preceded by a short practice session consisting of eight trials to familiarize participants with the task and adjust the volume if necessary. Participants took between 25 and 35 min to complete all four tests. All tests (pretest, posttest, posttest2, and delayed test) were completed using Praat (Boersma & Weenink, Reference Boersma and Weenink2018).
Data analysis
The effects of the two training methods (ID and DIS training) on the identification and discrimination of English vowels presented in nonword and real word stimuli were examined by analyzing participants’ results at pretest, posttest, posttest2 (for CG), and delayed test. Score (correctly or incorrectly identified or discriminatedFootnote 1) was the dependent variable. Two logistic mixed effects models were used (one for identification, and one for discrimination). Group (IDG, DISG, CG), test (pretest, posttest, posttest2, delayed test), word type (real word, nonword), and all possible two-way and three-way interactions were included as fixed factors. Subject-specific random intercepts and random slopes for time as well as word-specific random intercepts and slopes for talker were considered as random effects (Barr, Levy, Scheepers & Tily, Reference Barr, Levy, Scheepers and Tily2013; Matuschek, Kliegl, Vasishth, Baayen & Bates, Reference Matuschek, Kliegl, Vasishth, Baayen and Bates2017). Random slopes were eliminated in the final model as the model did not converge in the case of the discrimination data, and so as to have comparable models for identification and discrimination. The difference between including or excluding random slopes in the case of the identification data was minimal and did not affect the levels of significance. Tukey’s correction was used in pairwise comparisons. The analyses were performed using the GLIMMIX procedure of the SAS software (SAS Institute Inc., Cary, NC, USA). The significance level was set to 0.05. The results for identification are presented first, followed by the discrimination results.
Results
Identification results
The results for the pretest, posttest, posttest2 (for CG), and delayed test are presented in Table 2, which shows the mean % correct identification of the target English vowels in nonword and real word stimuli per group and test.Footnote 2 The mean identification accuracy scores and confidence intervals are graphically presented in Figure 1. The outcome of the logistic mixed effects model is given in Table 3. Test yielded a significant main effect (p < .001), explained by the general increase in identification accuracy from pretest (58%) to posttest (70.8%) and delayed test (76.5%), across groups and word type.Footnote 3 Identification scores were numerically higher in real words (74.4%) than in nonwords (65.2%), but the effect of word type did not reach significance. There was no effect of group but the interaction between test and group was significant (p < .001), as well as the interaction between test and word type (p = .005). No other interactions reached significance (see Table 3). Significant interactions were examined through pairwise comparisons with a Tukey correction (the results of all the pairwise comparisons are presented in Appendix B). Regarding the word type by test interaction, vowel identification in real word stimuli was significantly more accurate than in nonwords at posttest (p = . 047) and the difference was marginally significant at delayed test (p = .05), but the two word types did not differ at pretest (p = .251) and posttest2 (p = .149). With respect to the test by group interaction, pairwise comparisons indicated that, for the trained groups, the difference between pretest and posttest results was significant at the p < .001 level. CG showed some nonsignificant improvement from the pretest to the posttest (5.7 percent points), which may be the result of continued exposure to the target language and familiarity with the task at the posttest. However, CG showed a much greater and significant improvement from posttest to posttest2 after undergoing training (15.1 percent points, from 66.8% to 81.9%), p < .001. The identification scores at the delayed test were significantly higher than at the pretest for all groups but did not differ from posttraining scores (posttest for IDG and DISG, posttest2 for CG), showing that the improvement from pre- to posttraining test was maintained at delayed test. The pairwise comparisons also showed that groups did not differ significantly at pretest (IDG: 57.4%, DISG: 55.8%, CG: 61.1%, across word type). At posttest, IDG’s identification scores were significantly higher than those of the other two groups (IDG: 79.7%, DISG: 67.1%, CG: 66.8%), p < .05 in both cases. Finally, as an alternative way of exploring the test by group interaction, group results were compared in terms of the difference between the pretest and the posttest. IDG and DISG were compared on the amount of improvement from the pretest to the posttest, and the two trained groups together were compared to CG. The results indicated that IDG’s improvement was greater than DISG’s (22.3 and 11.3 percent points, respectively, t = 3.29, p = .001) and that trainees’ improvement from pretest to posttest (IDG and DISG together) was significantly greater than CG’s (16.8 and 5.7 percent points, respectively, t = -3.15, p = .0016). Therefore, ID training and DIS training resulted in a significant improvement in identification accuracy, in contrast to the lack of significant improvement for CG, and the improvement was greater for IDG than for DISG, and for the trained groups (IDG and DISG together) than for CG. These outcomes are revisited in the discussion section in light of the study’s predictions. The results of the discrimination tests are presented next.
Discrimination results
Table 4 presents the % correct discrimination per group at pretest, posttest, posttest2 for CG, and delayed test, for nonword and real word stimuli (see footnote 1). The results are presented graphically in Figure 2, which includes confidence intervals. Table 5 shows the outcome of the logistic mixed effects model, which mirrors the results obtained in the identification test, with the addition of a significant effect of word type. Thus, test yielded a significant main effect (p < .001), reflecting the increase in correct discrimination accuracy from pretest (74.9%) to posttest (81.1%) and delayed test (83.5%), across groups and word type.Footnote 4 Discrimination was more accurate in real words (83.1%) than in nonwords (77.2%), reaching significance at the p < .05 level. Group did not yield a significant main effect, but the test by group interaction was significant (p = .028), and there was also a significant interaction between test and word type (p = .018; see Table 5 for details). The results of all the Tukey-corrected pairwise comparisons exploring these interactions are given in Appendix B. With respect to the test by word type interaction, real words obtained significantly higher accuracy scores than nonwords at posttest (p = .019), posttest2 (p = .016), and delayed test (p = .003), but not at pretest (p = .161). The interaction between test and group is explained by several facts. First, as was found for identification, the trained groups’ scores at the posttest were significantly higher than at the pretest (IDG: pretest = 74.2%, posttest = 82.8%; DISG: pretest = 74.3%, posttest = 82.2%; p < .001 in both cases), but there was no significant difference between pre- and posttest results for CG (pretest = 76.5%, posttest = 78.8%, p = .604). CG’s scores improved significantly after training (posttest2 = 83.3%), p = .043. On the other hand, the results of the delayed test did not differ from posttraining scores, showing that the improvement was maintained four months after training had ended for all three groups. Between-group comparisons showed that there was no significant difference between any groups at any test time (see Table B9 in Appendix B). As was done for the identification results, the test by group interaction was explored further by comparing group results in terms of the difference between pretest and posttest. Again, the two trained groups together (IDG+DISG) were compared to CG, and IDG and DISG were also compared. The results revealed that the trainees’ improvement from pretest to posttest (IDG and DISG together) was significantly greater than CG’s (8.2 and 2.4 percent points, respectively, t = -2.75, p = .0061) and that DISG’s and IDG’s improvement did not differ significantly (8.6 and 7.8 percent points, respectively, t = 0.37, p = .711). In brief, the results show that IDG and DISG, but not CG, improved significantly from pretest to posttest and that the trained groups outperformed CG, but did not differ from one another.
Finally, to examine if the degree of improvement in discrimination and identification were related at an individual level, a Pearson’s correlation was conducted involving each individual’s improvement in each type of task. Specifically, the difference between pre- and posttraining tests in percent points was calculated for each participant in each task across nonword and real-word stimuli. For IDG and DISG, improvement reflects the difference between posttest and pretest, while for CG the difference between posttest2 and posttest was calculated. The results indicated that improvement in the two measures was significantly correlated, r = .324, N = 76, p = .004, as illustrated by the scatterplot in Figure 3.
Discussion
Identification and discrimination of L2 vowels
The first goal of this paper was to compare the effect of ID and DIS training on both the discrimination and the identification of L2 vowels. In accordance with our predictions, both ID and DIS training had a positive effect, as shown by the significant differences between the pretest and posttest for IDG and DISG in both identification and discrimination accuracy, and the fact that the trained groups outperformed CG in both perceptual tasks in terms of rate of improvement. The control group improved numerically from the pretest to the posttest but not significantly. Recall that pre- and posttests included new nonwords and real words and thus the improvement from pre- to posttest constitutes a measure of generalization, as discussed below. Thus, the first finding of the current study is the fact that both training methods (ID and categorical DIS) were effective in enhancing not only the object of training (ID for IDG and DIS for DISG) but also the untrained tasks (DIS for IDG and ID for DISG). These results are in line with previous research reporting that ID can improve categorical discrimination (Iverson et al., Reference Iverson, Pinet and Evans2012; Wayland & Li, Reference Wayland and Li2008), and categorical DIS improves identification (Flege, Reference Flege1995; Nozawa, Reference Nozawa2015; Carlet & Cebrian, Reference Carlet and Cebrian2022), and with a previous study that compared identification training with a discrimination training that included both auditory and categorical tasks and found reciprocal effects (Shinohara & Iverson, Reference Shinohara and Iverson2018). Studies that report no effect of DIS on either identification or discrimination (e.g., Strange & Dittman’s (Reference Strange and Dittmann1984) lack of generalization effects), or of ID on discrimination (Lengeris & Hazan, Reference Lengeris and Hazan2010), made use of DIS tasks that involved synthetic stimuli and low interstimulus intervals. Thus, no cross-task improvement was found when training relied on sensory information and low-level processing. In this sense, the results of earlier studies support the idea that when listeners perform the same/different discrimination task, they rely on short-lived sensory information that is useful for determining if two stimuli are physically identical or not but is less conducive to the development of long-term memory representations for L2 categories (Flege, Reference Flege1995). Identification training, particularly when using multiple stimuli from the same category, increases listeners’ sensitivities to the common properties that make a given category distinguishable from other categories and thus promotes the formation of more robust long-term memory representations of those categories (Jamieson & Morosan, Reference Jamieson and Morosan1986). In contrast to an auditory DIS task that relies on sensory information (e.g., Strange & Dittman, Reference Strange and Dittmann1984), a categorical DIS task includes multiple tokens from the same category (e.g., from different talkers or different productions by the same talker), and involves evaluating if the stimuli presented are sufficiently different to belong to separate categories. Thus, it is possible that when performing a categorical DIS task, listeners may in fact be identifying each stimulus in the pair individually before determining if they are the same or different sounds. This possibility was suggested by Shinohara and Iverson (Reference Shinohara and Iverson2018), the only previous study to our knowledge to have contrasted the effect of ID and DIS training (using a combination of auditory and discrimination tasks) on both measures in a single study and whose results coincide with the current results. These authors explained that DIS trainees may covertly label the phonemes when performing a categorical discrimination task. Recall that in a categorical DIS task same trials include two physically different stimuli from the same category. Thus, precisely given that sameness cannot be judged on the basis of physical identity and that stimuli may be identified prior to being compared, a categorical DIS task may encourage a similar level of phonological encoding to that of ID tasks. As previously discussed, ID involves determining to which of a set of internal representations a given stimulus belongs and entails a higher-level phonological encoding (Flege, Reference Flege1995; Logan & Pruitt, Reference Logan, Pruitt and Strange1995). Hence both ID and categorical DIS tasks may involve similar levels of processing that enhance the formation of more robust L2 categories (Polka, Reference Polka1992; Flege, Reference Flege1995; Wayland & Li, Reference Wayland and Li2008). An alternative view may also be considered, that is, the possibility that training identification may enhance discrimination since identifying a given vowel category correctly implies distinguishing it from perceptually close vowels (i.e., tokens of /æ/ and /ʌ/ stimuli cannot be identified correctly unless the listener perceives these two vowels as different). Therefore, ID trainees may have been implicitly trained on discrimination by being presented with confusable categories such as /æ/ and /ʌ/ across trials in the ID task, which can explain why ID training is as effective as DIS training in improving discrimination.Footnote 5 In any event, the significant correlation between identification and discrimination improvement found in the current study (also reported by Shinohara and Iverson[Reference Shinohara and Iverson2018]) supports the idea that ID and categorical DIS tasks may involve comparable strategies on the part of the listener and may enhance similar abilities and be mutually beneficial.
The second main finding of the study is that while both IDG and DISG experienced a very similar improvement in discrimination as a result of their respective training regimes, IDG clearly outperformed DISG in identification accuracy after training. This seems to point to an asymmetry between the two tasks, as ID training was found to improve identification accuracy more than DIS training, but DIS did not improve discrimination more than ID. This outcome was not expected, as we predicted similar cross-task effects given the assumption that both tasks involve similar processes. The better results for IDG with identification could be partly explained by procedural learning, that is, the result of task familiarity, as IDG outperformed DISG precisely in identification. However, if differences between the two methods were simply explained by task familiarity, we would also expect DISG to outperform IDG in the discrimination results, and this was not the case. The current results are in fact in agreement with previous research that has compared ID and DIS training within the same study that has shown an advantage of ID for vowel identification (Nozawa, Reference Nozawa2015; Carlet & Cebrian, Reference Carlet and Cebrian2022). ID’s superiority in identification may stem from methodological differences between the two tasks. Recall that while the two training regimes used the exact same number and set of stimulus words, ID involved twice the number of trials, as a single word was presented at a time. Thus, given that feedback was provided after each trial, ID offered twice the amount of feedback, and of a more specific kind (the vowel identity). Even if categorical DIS may involve some level of identification to assess the shared identity or not of the two physically different stimuli, the DIS task itself does not consist of labeling the stimuli using one of several options presented. Hence, the greater number of trials and consequently greater opportunities for feedback, together with the use of a more explicit type of feedback in ID training, may account for ID’s greater benefit on identification over DIS training.
Regarding the control group, the nonsignificant numerical improvement observed from the pretest to the posttest may be attributed to task familiarity as well as to continuous exposure to the target language as participants were undergraduate students majoring in English. CG’s main and significant improvement occurred after undergoing DIS+ID training, reaching accuracy levels comparable to those of IDG in identification and to both groups in discrimination. This would indicate that combined use of tasks may be as efficient as the use of ID alone, although to fully assess this possibility a different study design would be necessary with DIS+ID training and ID training being implemented in parallel.
Measures of robust learning: generalization and retention
The results of the posttests indicate that both types of perceptual tasks promoted generalization of learning as testing stimuli involved new voices and new words (recall that tests included stimuli and talkers not heard during training). Thus, training with nonwords resulted in an improvement of vowel perception in untrained nonwords as well as real words. This finding is in line with our predictions based on recent studies suggesting that the use of nonwords allows focusing on the phonetic form and avoids word familiarity effects as well as lexical and orthographic biases (Thomson & Derwing, Reference Thomson, Derwing, Levis, Le, Lucic, Simpson and Vo2016; Fouz-González & Mompeán, Reference Fouz-González and Mompean2021; Ortega et al., Reference Ortega, Mora-Plaza, Mora, Kirkova-Naskova, Henderson and Fouz-González2021). This potential advantage of nonword stimuli, however, may not be unconstrained. Mora et al. (Reference Mora, Ortega, Mora-Plaza and Aliaga-García2022) reported that the advantage of training with nonwords over real words disappeared when background noise was added: the use of masking noise in an immediate repetition task in production training revealed a detrimental effect of noise with nonword stimuli but not with real word stimuli. According to the authors, the presence of noise hinders the focus on phonetic form, which is precisely what makes nonword stimuli advantageous.
On the other hand, participants were more successful in identifying and discriminating target vowels in real words than in nonwords (74.4% identification accuracy with real words vs. 65.2% for nonwords, and 83.1% discrimination accuracy with real words vs. 77.2% with nonwords, across groups and tests) although the difference reached statistical significance only with discrimination. This general real-word advantage in L2 vowel perception was also expected in light of earlier findings that show better perception of L2 sounds in real words than in nonwords (Mora, Reference Mora, Hazan and Iverson2005; Rato & Carlet, Reference Rato and Carlet2020) and that suggest that word knowledge and lexical representations play a role in L2 segmental perception (e.g., Yamada et al., Reference Yamada, Tohkura, Kobayashi, James and Leather1997). Nevertheless, given that all stimuli were presented in isolation, and that real word stimuli involved minimal pairs (e.g., bead, bid, bed, bad, bud, bard, bird, see Appendix A), it remains to be assessed how lexical status may be an advantage with words that are likely phonetically confusable and possibly stored with ambiguous or neutralized lexical representations (Darcy & Holliday, Reference Darcy and Holliday2019). Exploring the relationship between phonological and lexical representations lies beyond the scope of the present paper, but the finding that learning acquired through training with phonetically-oriented stimuli (nonwords) transfers to real word perception underscores the efficacy of the training methodology.
Regarding long-term retention, the results of a delayed test, completed four months after training ended, consistently replicated the posttraining results for all groups, with mean correct identification and discrimination accuracy values that were always very close and never differed significantly from posttraining scores. These results show evidence of long-term retention after a longer period than most previous studies (generally up to two or three months, see Thomson, Reference Thomson2018). Lively et al. (Reference Lively, Pisoni, Yamada, Tohkura and Yamada1994) reported that Japanese learners of English retained the improvement in their identification of the /r/-/l/ contrast three months after training, but signs of decline were observed after six months. Iverson and Evans (Reference Iverson and Evans2009) found that L2 learners were able to retain their improvement in vowel identification an average of four months after training. Longer retention has been reported in a study testing American English speakers’ perception of Mandarin tones, where improvement was still evident six months after the posttest (Wang, Spence, Jongman & Sereno, Reference Wang, Spence, Jongman and Sereno1999). Longer retention may be more likely with suprasegmental phenomena like tone than with segmental contrasts, an issue that remains to be explored (Thomson, Reference Thomson2018). The evidence of both generalization and retention of learning indicates that both ID and DIS training can successfully trigger the development of L2 categories that are robust enough to perceive L2 sounds accurately disregarding interstimulus variations resulting from talker or speech rate differences that are irrelevant to category identity (Flege, Reference Flege1995). Importantly, retention and generalization were not only found for the perceptual task that was the object of training (ID for IDG and DIS for DISG) but equally for both perceptual tasks for all groups. The current results did not show a tendency for DIS trainees to show more evidence of retention than ID trainees in identification, unlike a couple of previous studies (Flege, Reference Flege1995; Carlet & Cebrian, Reference Carlet and Cebrian2022). Flege (Reference Flege1995) found that discrimination trainees’ scores in ID accuracy at delayed tests had in fact increased while ID trainees’ scores had decreased a little. In addition, DIS training, but not ID training, was found to improve the identification of sounds that are present in the stimuli but not the focus of training (Carlet & Cebrian, Reference Carlet and Cebrian2022). These outcomes were interpreted to indicate that DIS training is more efficient at consolidating learning. The current study did not find such an advantage for DIS training, but it differs from previous studies in several ways. It tested both ID and DIS, not only ID, and focused on vowels (as opposed to consonants or both vowels and consonants); in addition, in the present study the delayed test was administered at a later time than in those previous studies (four months vs. two months after the posttest).
Final conclusions, limitations, and further research
The current study contrasted the effect of two perceptual training approaches (ID and DIS) on both L2 vowel identification and L2 vowel discrimination within the same study and included measures of robust learning (generalization and retention), thus allowing a thorough examination of the efficacy of the two methods. The results provide evidence of the suitability of both ID and DIS training for improving both the identification and the discrimination of L2 vowels. This finding is supported by strong generalization and retention effects, as training with nonwords generalized to new nonwords and real words, and to new voices, and retention of learning was evident four months after training had finished. ID appeared to be more successful than DIS in improving vowel identification accuracy, a finding that is in agreement with some previous research (Nozawa, Reference Nozawa2015; Carlet & Cebrian, Reference Carlet and Cebrian2022). The cross-task effects are explained by the fact that identification and categorical discrimination may involve similar levels of processing given the presence of multiple stimuli, the long interstimulus interval used in DIS and, consequently, the likelihood that categorical DIS involves the identification of each member in the stimulus. On the other hand, the fact that both ID and DIS equally improve discrimination but ID training has a greater effect on identification than DIS training can be explained by crucial methodological differences between the two training regimes, namely the type and the amount of feedback obtained with identification. The current study contributes to the line of recent studies showing the efficacy of using nonword stimuli in HVPT for the learning of segmental contrasts, further supported by the transfer of this knowledge to real word stimuli.
The present study had a few limitations. First, although a total of 44 participants were recruited, the fact that participants were distributed among three groups and that only 38 completed all the tasks, resulted in a relatively small sample for each training method (13 in IDG, 14 in DISG, and 11 in CG). Even if significant effects of training were obtained, group differences were not always observed, which could have emerged with a larger sample. In addition, the study was limited to a subset of L2 vowels, which included challenging contrasts for the population under study, but it did not examine the whole vowel inventory, nor did it explore the identification and discrimination of consonant sounds. Regarding the training stimuli, the voicing of the final consonant was not completely controlled, as there were 40 words ending in a voiced obstruent and 48 in a voiceless obstruent. English vowels are known to be shorter preceding a voiceless consonant, which may have made vowels before a voiced obstruent easier to perceive. We expect that the impact of this design problem on the overall training regime to have been minimal given the small difference in the number of tokens per voicing condition, and testing stimuli were appropriately balanced, but future research should address this and the previously mentioned limitations.
The current study adds to the wealth of research that generally supports the effectiveness of HVPT. These studies generally provide evidence of what can be achieved through phonetic training, but the question that remains is how training specifically affects the process of L2 category formation (Iverson & Evans, Reference Iverson and Evans2009; Shinohara & Iverson, Reference Shinohara and Iverson2018). In other words, it is unclear if improvement from pre- to posttraining actually reflects real changes in L2 categorization. It has been proposed that training may help learners to be more consistent and successful in using their existing categories to perceive their L2 sounds without necessarily altering the learners’ internal representations of L2 categories (Iverson, Hazan & Bannister, Reference Iverson, Hazan and Bannister2005; Shinohara & Iverson, Reference Shinohara and Iverson2018). Iverson and colleagues found that Japanese learners of English became more consistent at using not only the primary acoustic cue (F3) but also an irrelevant or secondary cue for native speakers (F2) in their perception of the English /r/-/l/ contrast (e.g., Shinohara & Iverson, Reference Shinohara and Iverson2021). This is in line with Polka’s (Reference Polka1992) observation that learners who undergo identification training may learn to identify L2 sounds accurately by paying attention to characteristics that may help differentiate nonnative categories, but which may not be the properties attended to by native speakers. More research is needed to fully evaluate what truly changes as a result of phonetic training and to investigate if and how internal representations of L2 categories can be altered through phonetic training. For instance, some recent research points to the use of tasks aimed at changing cue-weighting, e.g., through cue enhancement or exaggeration, as possible inducers of actual category changes (Zhang et al., Reference Zhang, Cheng, Qin and Zhang2021b).
Finally, another aspect to be considered is the pedagogical potential of HVPT for pronunciation teaching and learning (Thomson, Reference Thomson2011). The current study shows that ID and categorical DIS tasks are successful methods of improving L2 vowel perception, although ID may be more suitable for training identification. Logan and Pruitt (Reference Logan, Pruitt and Strange1995) indicate that categorical DIS tasks might be more effective than identification tasks in the early stages of learning, when identification labels may not be fully understood and reliable. Carlet (Reference Carlet2017) and Shinohara and Iverson (Reference Shinohara and Iverson2018) suggest that a combination of both ID and DIS tasks might be beneficial as they may add variation and flexibility to the training regimes. In fact, a complete evaluation of training methods should also consider the learners’ reactions to the training methods. Flege (Reference Flege1995) reported that ID trainees felt that training was more enjoyable, interesting, and beneficial than DIS trainees did. Similar impressions are reported by Carlet (Reference Carlet2017). Possibly, a full examination of training methodologies should take into consideration not only the objective efficacy of the method but also the subjective impressions of the learners undergoing training.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0272263124000408.
Acknowledgements
This work was supported by Research Grant Nos. FFI2017-88016-P and PID2021-122396NB- I00 to the first author, No. PID2019-107814GB-I00 from the Spanish Ministries of Economy and Competitiveness and Science and Innovation, and research grant 2021SGR00544 to the Experimental Phonetics research group from the Catalan Agency for Management of University and Research Grants (AGAUR).
Competing interest
The authors declare none.
Appendix A Testing stimuli
Note: Two different productions per talker for bard and pet were used.
Note:
Note: * Four possible combinations: two talker orders T1-T2, T2-T1, and two vowel orders: V1-V2, V2-V1.
Appendix B Tukey pair-wise comparison results for each significant interaction.