1. Introduction
People use language to communicate and express meaning, and this meaning is essentially conveyed by vocabulary. Thus knowledge of vocabulary is fundamental to all language use, and so must be learned in some manner in order for learners to become communicative in a new language. However, the lexicons of most languages are very large. For example, Goulden, Nation & Read (Reference Goulden, Nation and Read1990) estimated there are 54,000 word families in English. Given that most word families have several members (e.g. stimulate, stimulated, stimulating, stimulates, stimulation, stimulative), this translates into many hundreds of thousands of individual word forms.Footnote 1 Even very proficient speakers will not know all of these words,Footnote 2 and Goulden et al. found that their New Zealand university undergraduates had an English vocabulary size of about 17,000 word families. This is still far out of reach for most second language (L2) learners, and it is not surprising that L2 teachers and textbook writers struggle with the sheer number of words that could be taught. What is required for pedagogical purposes are descriptions of the amount of vocabulary which is necessary to be functional in specific communicative contexts, not to match the language level of native speakers.
In order to generate such descriptions, two things are required. First, one must know what percentage of the vocabulary in a stretch of spoken or written discourse needs to be known by a learner in order for him or her to understand the discourse. This is known as lexical coverage. People can usually understand speech or writing even if there are a few unknown words, so 100% coverage is not typically necessary. But if too many words are unknown, comprehension is compromised and listening or reading becomes a chore. What percentage of words should be known? Most research suggests that coverage in the range of 95–98% is adequate for acceptable comprehension, or in other words, that acceptable comprehension can be achieved with 2–5% of the words unknown (e.g. Hu & Nation Reference Hu and Nation2000; Laufer & Ravenhorst-Kalovski Reference Laufer and Ravenhorst-Kalovski2010; van Zeeland & Schmitt Reference van Zeeland and Schmitt2012). Second, when a coverage figure is established, one must determine how many specific words this corresponds to. For example, a typical finding is that 98% coverage in written texts corresponds to knowledge of about 8,000 word families (e.g. Nation Reference Nation2006). We will refer to the number of words needed to meet a lexical coverage percentage in various communicative contexts as vocabulary size.
Another factor is pedagogical practicality. Since the number of words that can be taught explicitly in language classes is limited, studies would do well to include an empirical consideration of learners’ capacity to acquire new vocabulary incidentally, through exposure to reading and listening input. Cobb (Reference Cobb2007), in a study using corpus coverage to calculate learning opportunities, showed that learners’ typically small vocabulary sizes of 2,500–3,000 word families can be partially explained by the very low rate at which words at subsequent frequency levels (i.e. 3,000+) occur in texts. Words must normally be met a certain number of times (a figure of 10 is often cited), but Cobb argues that this number of repetitions is typically not available for less frequent words. If his analysis is correct, then the notion that written language comprehension depends on lexicons on the scale of 8,000 word families amounts to a discouraging picture for learners, since relatively few of them will arrive at that figure. However, learners may well be able to cope with lower levels of lexical coverage/vocabulary size than the research suggests. They may be able to do this through using resources like dictionaries and various forms of online support. If so, the coverage/size figures may be set higher than necessary for real-world use, and new ‘resource aided’ figures need to be developed.
It is crucial to have good estimates of the vocabulary sizes necessary to be functional in specific contexts and uses of a language, because these estimates form learning targets for language students. An estimate that is too low could lead to a lowering of pedagogical goals such that learners would not acquire a vocabulary large enough to make competent language use possible. An estimate that is too high would be unnecessarily demotivating for learners, and may include words that are so infrequent that they have little practical utility in normal language use. There are a limited number of studies informing these essential size targets, and so it is vital to replicate and expand upon the ones we have. This paper will suggest replication of studies of lexical coverage (van Zeeland & Schmitt Reference van Zeeland and Schmitt2012), vocabulary size (Nation Reference Nation2006), and plausible learnability (Cobb Reference Cobb2007) in order to develop a more reliable, nuanced, and ecologically valid understanding of the amount of vocabulary learners need to acquire in order to become proficient language users in their chosen domains.
2. The original studies and suggested approaches to replication
2.1 Lexical coverage of spoken discourse (van Zeeland & Schmitt Reference van Zeeland and Schmitt2012)
Most research on lexical coverage in relation to L2 comprehension has been conducted on reading, and we now have a fairly good idea of the percentage of vocabulary that needs to be known to allow comprehension of written text. The earliest research in this area indicated that 95% lexical coverage was needed (Laufer Reference Laufer, Lauren and Nordman1989). In real terms, 5% unknown vocabulary equates to about one unknown word in roughly every two lines of text, and therefore over 15 unknown words on every page. Thus, it is perhaps not surprising that subsequent research by Hu & Nation (Reference Hu and Nation2000) suggested a higher coverage figure closer to 98%. More recently, Schmitt, Jiang & Grabe (Reference Schmitt, Jiang and Grabe2011) investigated each percentage point of coverage between 90 and 100%, in an attempt to describe the overall relationship between coverage and comprehension. This revealed a linear relationship between the two, which suggests that the coverage level required depends on the degree of comprehension aimed for. Based on their data, if 60% comprehension is the goal, 98% lexical coverage is needed. Laufer & Ravenhorst-Kalovski (Reference Laufer and Ravenhorst-Kalovski2010) support the idea of basing the required coverage level on the reading comprehension wished for (e.g. if a learner's goal is to read a second language novel for pleasure, then 100% comprehension may not be needed or worth the learning investment). These authors suggest two lexical coverage thresholds, depending on the definition of ‘adequate’ comprehension: 98% as the sufficient and 95% as the minimal. Based on the performance of their Israeli participants, the authors conclude that 95% coverage enables acceptable comprehension and is probably viable with some support (e.g. teacher or learner resources such as dictionaries) and that 98% coverage leads to successful comprehension by most learners, and is likely to enable independent reading. Overall, the consensus is that about 98% is the lexical coverage which is most appropriate for most purposes involving written text.Footnote 3
In contrast, there has been very little research into the lexical coverage required for listening. The main study to date has been van Zeeland & Schmitt (Reference van Zeeland and Schmitt2012). They had their ESL (mixed L1) participants listen to four anecdotes told in the first person about people getting into unusual situations. The stories had various percentages of words replaced with nonwords (0, 2, 5, and 10%), so that percentages of known vocabulary in the stories were precisely 100, 98, 95, and 90% respectively. Participants’ comprehension was measured by a ten-item multiple choice test for each anecdote. The researchers found that the participants who knew 100% of the words in a story had a mean score of 9.62, those knowing 98% of the words scored 8.22, those with 95% had 7.65, and those with 90% knowledge scored 7.35. Overall, knowledge of greater percentages of the vocabulary in the stories led to better comprehension, and thus test scores. However, even at the 95% and 90% knowledge levels, the comprehension was still quite good in absolute terms and good enough for many practical purposes. There was no statistical difference in the test scores between the 95% and 90% knowledge levels, but the standard deviations were large at the 90% level, indicating that there was real variability in learners’ ability to cope with this low amount of lexical coverage. Van Zeeland & Schmitt thus concluded that 95% lexical coverage was the more reasonable criterion for adequate comprehension, because at this level, the performances were much more consistent among the participants.
The van Zeeland & Schmitt study is a good start, but replications could usefully address its inevitable limitations. We suggest approximate replications of this study (Porte Reference Porte and Porte2012), where certain variables might be changed to determine how generalizable the original paper's results are and to either strengthen or challenge the conclusions of that paper. Van Zeeland & Schmitt used informal narratives of about two minutes length in their study, with repeated listenings, and they acknowledge that their results might be viewed as ‘best-case’ performance. The most obvious variable to start exploring is the type of listening. Narratives typically have a straightforward chronological structure, which should make listening easier than, say, a lecture or a detailed explanation. This is especially true because listeners rely more on top-down processing than readers (Lund Reference Lund1991; Park Reference Park2004). This suggests that listening comprehension may be largely based on factors such as world knowledge and topic familiarity. Such top-down information is believed to be compensatory in use, in the sense that it can be employed strategically by listeners to compensate for inadequate knowledge of the L2 or an inability to recognize words in continuous speech (Field Reference Field2004; Vandergrift Reference Vandergrift and Hinkel2011). Thus passages with more obvious organization (such as narratives) should be easier to comprehend when listening, and indeed narratives have been found to be the most comprehensible genre for listeners (Rubin Reference Rubin1994). It is an open question whether types of discourse with a less obvious organization (e.g. everyday chat, jokes, political speeches) can also be comprehended with 95% coverage, and these types should be explored. It is not always obvious a priori whether they might require higher or lower lexical percentages for comprehension. For example, everyday chat, with its numerous digressions and topic changes, might require a higher lexical coverage for comprehension. Conversely, the greater opportunities for questions and clarification might allow comprehension with a lower percentage of coverage.
Another variable which could be usefully explored is the length of listening. Van Zeeland & Schmitt's passages were relatively long, in experimental terms, at about two minutes. However, many real-world listening contexts, such as attending to academic lectures, political speeches, and radio talk shows require much longer periods of concentration. It would be interesting to determine whether truly extended listening contexts (20+ minutes) would be comprehensible with 95% coverage, or whether the unknown words would eventually begin to affect comprehension, or at least make listening onerous. On the other hand, it might be that as the sense of the message begins to accumulate, more top-down processing can come into play and listening becomes easier.
Another decision made by the researchers was to allow the participants to listen to the passages twice, in order to avoid memory affecting the comprehension results. However, as most listening is a one-off affair, it would be interesting to know whether single listenings would also be consistently comprehensible at 95% coverage. A straightforward way of assessing this would be to repeat the van Zeeland & Schmitt study, but have some participants listen to the passages twice, as in the original study, and others only once. If the single listeners do much worse, then this might indicate that the 95% coverage figure is too optimistic, and the coverage figures from the single listenings might be a more appropriate indication of the necessary lexical coverage required for non-interactive listening.
Finally, the researchers decided not to allow participants to ask questions. This is convenient for research (as a way of equalizing the participants, since some will ask more questions than others due to personality factors) but it comes at some cost in ecological validity. Only a small proportion of real-life interpersonal listening takes place with no option to interact or ask questions (although this clearly is the case with media exposure such as listening to radio, TV, movies, online lectures, YouTube videos, etc.). In any case, the study should be replicated under more typical interpersonal conditions, which could be predicted to lower the coverage needed to a point below 95%.
2.2 Vocabulary size (Nation Reference Nation2006)
Perhaps the most important vocabulary size figures to establish definitively are those relating to L2 learners’ ability to be functional in general English in both the written and spoken modes. Once established, these figures should inform all non-specialized (e.g. non-English for Special Purposes (ESP)) English teaching pedagogy and materials design. To date the most influential paper in this area is undoubtedly Nation's 2006 study. Using a mini-corpus of five English novels (Lord Jim, Lady Chatterley's Lover, The Turn of the Screw, The Great Gatsby, and Tono-Bungay), Nation calculated that a learner would be required to know about 4,000 of the most frequent word families plus proper nouns to reach 95% lexical coverage, and around 8,000–9,000 families plus proper nouns to reach 98% coverage. He found similar figures for a corpus of newspapers. Turning to unscripted spoken English, Nation used two parts of the Wellington Corpus of Spoken English (n.d.). One part included talk-back radio, where listeners phone in with their spontaneous comments on the issue being discussed, and the other was made up of friendly conversation between family members and friends. Nation found that about 3,000 word families plus proper nouns provided more than 95% coverage, but that it took 6,000–7,000 word families to reach 98% coverage. He also investigated the movie Shrek, for which it took 4,000 word families plus proper nouns to reach 95% lexical coverage, and 7,000 to reach the 98% level. These figures are broadly in line with Webb & Rodgers’ findings based on the scripted talk in movies (Reference Webb and Rodgers2009a) and television (Reference Webb and Rodgers2009b).Footnote 4
To be clear, this research did not involve actual learners with knowledge of the 4,000 or 8,000 highest frequency word families; nor did it involve measuring learners’ comprehension. Instead, it used frequency profiling of Nation's target text collections and corpora to determine the number of word families that learners would hypothetically need to know to achieve a particular level of known word coverage (e.g. 98%). Nevertheless, Nation's vocabulary size figures (based on 98% coverage) of 6,000–7,000 word families for spoken English and 8,000–9,000 for written English are very widely cited. Given the impact of his study, it is important to replicate it to confirm (or revise) those figures.
An approximate replication approach also seems appropriate to address Nation's study as there are a number of variables that could usefully be manipulated. For initial replications, we propose to leave unchanged Nation's methodology for deriving vocabulary size, which uses his British National Corpus (BNC)-based word family lists as counting units, as it is well established and has proven its usefulness. What is needed are replications of Nation that use this same methodology but test much larger corpora of general English. Nation used rather small data sets in his influential study: the single novel Lady Chatterley's Lover (121,000 words), the five novels mentioned above taken together (474,000 words), a newspaper corpus (440,000 words), the script of the movie Shrek (10,000 words), and two parts of the Wellington Corpus of Spoken English (around 200,000 words). Nation's purpose was to determine the vocabulary sizes necessary to read and listen to general English in various contexts of use, so testing these various small sample corpora made sense. However, he conflated the individual results in order to come up with the overall vocabulary size figures discussed above. It is these global figures (6,000–7,000 word families for listening; 8,000–9,000 word families for reading) which now need to be checked with larger, more comprehensive corpora.
Two large current corpora against which the coverages of Nation's lists could usefully be tested are the Corpus of Contemporary American English (COCA) and the Corpus of Global Web-based English (GloWbE). The COCA was developed by Mark Davies and currently contains more than 450 million words, including 20 million words gathered each year from 1990 to 2012 (as of 22 August 2014). It is a balanced corpus, being equally divided among five genres/registers: spoken, fiction, popular magazines, newspapers, and academic journals. Importantly, it is not static, as it is updated at least twice each year, which promises to keep it current, instead of being a ‘snapshot’ of English at a single point in time like the BNC. The COCA is thus an excellent corpus from which to derive vocabulary size information. It is now available to be fully downloaded onto one's personal computer at http://corpus.byu.edu/coca/, which makes the suggested replications eminently feasible.
The GloWbE is a brand-new corpus (also created by Mark Davies and released in April 2013) consisting of 1.9 billion words from 1.8 million web pages in 20 different English-speaking countries. Given the importance of the internet for global communication and information transfer, it would be very interesting to determine how much vocabulary knowledge is necessary to comprehend this resource at the 95% and 98% levels of coverage, especially as it is so diverse and dynamic. It can be accessed at http://corpus2.byu.edu/glowbe/ and is also fully downloadable.
The BNC is another corpus possibility. At 100 million words, it is smaller than either the COCA or GloWbE. It is also becoming dated, as it was compiled from a range of sources in the latter part of the 20th century. However, its ten-million word spoken component is relatively large for an unscripted spoken corpus, and this could be useful for exploring the requirements for spoken English. The BNC can be consulted at www.natcorp.ox.ac.uk/ or http://corpus.byu.edu/bnc/.
Another variable that could be explored is the word lists which are used to interrogate the target corpora. Nation (Reference Nation2006) used a word list based on the BNC, but this British-English based metric may be due for updating and revision. Nation recently took a step in this direction by updating his original BNC-based frequency lists using COCA frequency information. His goal was to further increase the generality of his original lists by reducing their British bias and making them more applicable to both British and American contexts. (See Nation Reference Nation2012a, available online, for details of the procedure.) The differences between the new and old lists are extensive and this has implications for the previously established coverage levels. Assuming the new combined BNC-COCA lists are a better indication of word frequency, then everything that has been done using the original BNC-based lists is ripe for replication using these new lists. Such replications may well change the established picture considerably. For instance, applying these more American lists could result in a downwards revision of the figure of 8,000 – the word size needed to achieve the 98% coverage level according to Nation's (Reference Nation2006) investigation of materials that included the American novels The Turn of the Screw and The Great Gatsby and the American film script Shrek. In other words, in these replications, American-English texts would no longer be analysed using a British-English frequency scheme, and this might well reveal that the levels of coverage reported in Nation (Reference Nation2006) can be reached with smaller vocabulary sizes.
Another proposal pertaining to the counting unit used in investigations of coverage is to replace word families altogether, and use lemmas instead. Although lemma-based studies would not be comparable with the range of family-based studies the field is built upon, there would be a number of advantages. In particular, lemmas might be a suitable counting unit for research focusing on vocabulary pedagogy, as learners do not typically know all word family members (e.g. Schmitt & Zimmerman Reference Schmitt and Zimmerman2002; see Schmitt Reference Schmitt2010, Section 5.2.1, for a full discussion of pros and cons of different counting units). A lemmatized frequency list of the complete COCA has recently been made available (up to the first 100,000 words) by Mark Davies on his COCA website (www.wordfrequency.info/intro.asp). A comparison study of vocabulary size using one of Nation's word family wordlists and Davies's lemma wordlist would be interesting indeed, and would help in interpreting how generalizable Nation's word family figures are for pedagogical purposes. Since lemmatized lists are more easily made automatically via alphabetical grouping than family lists (which require manual work, e.g. to add unhappy to the happy family), this avenue of research could also identify important efficiencies in creating new lists from ever-evolving and dynamic corpora.Footnote 5
An extension to the vocabulary size issue would be to determine whether Nation's vocabulary coverage and size figures as calculated for general English also pertain to more specialized domain-specific contexts. That is, can we generalize his vocabulary requirements for general English to more specific domains, e.g. within Academic English or Professional English? A small number of studies (e.g. Hsu Reference Hsu2011, Dang & Webb Reference Dang and Webb2014) suggest that this is not particularly straightforward. The studies show that size figures for the same levels of coverage differ depending on the degree of specificity. For example, Dang & Webb (Reference Dang and Webb2014) found that while 8,000 word families achieved 98% coverage of a general academic corpus, the subdiscipline requirements ranged from 5,000 (social sciences) to 13,000 (life and medical sciences). This implies that it is necessary to develop specific size requirements tailored to various domains and contexts. It is beyond the remit of this article on replication to give details about how to operationalize this extension of Nation's research. Instead, we direct readers’ attention to Hsu (Reference Hsu2014), who provides a useful model of how this type of research might be carried out. Focusing on engineering, she profiled a 4.57 million-word corpus of English engineering textbooks. Using Nation's general BNC/COCA 25K lists, she found that knowledge of the first 2K plus proper nouns provided just 80.7% coverage of the corpus, and that 5,000 word families were needed to reach 95% coverage. Given that in Taiwan knowledge of 2,000 general service words is a high school graduation threshold, Hsu's aim was to develop a word list that would fill the 14.3% gap in coverage between the 2K level and 95% coverage. The resulting Engineering English Word List consists of just 729 word families, considerably fewer than the 3,000 general word families needed to achieve similar coverage. This study is a good example of how the principles of coverage and size discussed in our article can be used to deliver very focused and pedagogically useful vocabulary learning targets for particular domains and learning contexts.
2.3 Coverage, size, and learning (Cobb Reference Cobb2007)
Size and coverage can be used to calculate not just comprehension, but also how much vocabulary it is possible or probable to learn from a particular text. Comprehension and learnability are interrelated, with the 95–98% coverage figures commonly cited as ‘lexical thresholdsʼ for both, although there is only one empirical finding that we know of to support the learning aspect. Swanborn & De Glopper (Reference Swanborn and de Glopper1999) undertook a meta-analysis of 20 studies of word learning in first language (L1) reading. They determined that known-unknown word ratios were a significant predictor of successful inferencing, and located some evidence for a threshold at one unknown word in 37 known (or, when 97% of the words are known). This coverage figure for successful inferencing is remarkably close to the 98% figure identified for L2 comprehension. However, beyond Swanborn & De Glopper, the coverage-inferencing-learning link is largely a common sense intuition at this point, and would be a fruitful area for research. But what seems sure is that learners with a typical vocabulary size of 2,000 word families will not comprehend or learn much new vocabulary from a text with more than 10% of its vocabulary beyond the most frequent 3,000 word families in English (3K), as is typically the case with all but simplified texts. Nor will they consolidate any correct inferences they do manage to make with texts that recycle these words only once or twice, again as is typically the case.
Cobb's study looked at this learnability issue and started from the question, ‘Can an adequate L2 reading lexicon be built from reading alone?’ His methodology made the following choices:
-
• He used Nation's (Reference Nation2006) BNC-based frequency lists as a source of learning objectives, in this case the third 1,000 word families.
-
• He hypothesized as the participant of his study a typical academic ESL learner with a vocabulary size of just over 2,000 words (2,112 word families, sd = 1,036, is a rough international average for academic ESL learners, according to Laufer's (Reference Laufer2000) census of English for academic purposes (EAP) and ESP instructors in eight countries).
-
• He assumed a year of study as the learning period, one year being the typical allowance in ESP-EAP situations, or two in rare situations.
-
• He assumed a maximum total yearly reading diet of either the Press, Academic, or Fiction divisions of the Brown corpus (179,000; 163,000; and 175,000 word tokens respectively, any of which equals about six stories the size of Alice in Wonderland or about 25 academic studies. The nature of the sub-corpora was intended to represent roughly the types of texts academic learners might be assigned to read, and the amount to be a generous estimate of what such students actually would read (remembering that with lexicons of 2,500 words, such texts would be presenting at least one unknown word in ten and hence be rather arduous to get through).
The specific research question, then, was how many of the most frequent 3,000 word families in English (3K) are present in each of these collections, and in what coverage proportions, from the perspective of a learner with knowledge of just over 2,000 words. Random samples from each of these first, second, and third 1,000 levels of the BNC lists were matched against the contents of the three hypothesized reading diets. The finding was that while the first and second 1,000 word families are well represented in any of the diets, the third 1,000 families thin out rather dramatically in all of them, with only about half appearing even six times. (Eight to ten times seems to be the minimum figure for reliable incidental learning indicated by the research, e.g. Horst, Cobb & Meara Reference Horst, Cobb and Meara1998). In other words, not much progress with the third 1,000 words could be expected from this presumably substantial exposure to natural (ungraded) text, at least not in the year or sometimes two that are normally available. By implication, then, pedagogies other than reading alone are needed to assure adequate progress toward the coverage objectives discussed in earlier parts of this paper.
This somewhat pessimistic finding about vocabulary growth from reading is controversial to say the least. First, it goes against the view, once almost universally held and still common amongst practitioners, that contextual inference from self-selected input is sufficient for all levels of vocabulary development (e.g. Krashen Reference Krashen1985). Second, it is a type of study that has not been undertaken before. Nation (Reference Nation2014: 1) called it a ‘notable exceptionʼ to a general lack of corpus-based studies of the feasibility of learning large amounts of foreign language vocabulary through reading, and, as already mentioned, it is novel in attempting to extend coverage analysis from comprehension to acquisition. Thus, for reasons of both controversy and originality, this study's findings will need substantial further investigation before they are accepted, ideally in the form of replication studies rather than just commentary and discussion. The replication should ideally be of two types: one that varies the data of Cobb (Reference Cobb2007) and another that varies the assumptions about how or how much learners can read. Varying the data might involve, for example, pitching larger samples of second- and third-thousand words or other levels of words against different and possibly more representative corpora; varying the assumptions might involve basing calculations on different ideas about the amount of reading the hypothesized learners are able to perform in a year, or even empirical evidence, which would amount to replacing hypothetical with real learners. In fact, both types of replications of Cobb (Reference Cobb2007) have already been undertaken, and these will now be summarized and any remaining questions identified.
On the corpus side, Nation in a series of conference presentations (Reference Nation2012b), a working paper entitled ‘How much input do you need to learn the most frequent 9,000 words?ʼ, and a published paper with the same title (2014), vastly extended the scale of Cobb's (Reference Cobb2007) corpus analysis, pitching each of the first nine k-levels against a 3-million token corpus of novels and other types of corpora, discovering in some detail the number of each level of words that would be encountered per unit of time assuming different reading rates. For example, if 300,000 words of the novels corpus were read, then 830 of the third 1,000 most frequent words would be met an average of 12.6 times (and so on up to just under 3 million words for most of the ninth 1,000), which is roughly in line with Cobb's proposal that with 167,000 words an insufficient number of third 1,000 words would be met for systematic acquisition. Nation, however, went on to propose that 300,000 words of reading is, in fact, possible, provided that (1) the reading goals were set higher than they are now (although ‘there is no published research to support [these proposed reading figures] for learners of English as a foreign languageʼ, p. 7); and (2) the texts involved ‘were at the right level for [the learners] so that the target words would make up around 2% or less of the running words in the textʼ (p. 7) with unknown words therefore met in a ratio of about one unknown in 50 known. Natural text, however, in most cases will not provide unknown words in such a friendly ratio, so the point is moot. For a learner who knows 2,000 word families, even a novel such as Lady Chatterley will comprise almost 7% post-fourth 1,000 items, such that unknown words will be met in a ratio of at least one unknown in 15 (roughly one per two lines of text). Further, many types of texts are even more lexically sophisticated than the novels on which these figures are based, and still further come in less inference-friendly formats than the chronological flow of everyday events which typifies novels. To summarize, then, the corpus part of Nation's replication confirms and extends Cobb's initial and in retrospect pilot-level work, but the proposal part simply highlights the need for empirical work on how much of what type of texts learners can actually read.
This second type of replication, on the learner side, is the topic of McQuillan & Krashen's (Reference McQuillan and Krashen2008) direct response to Cobb (Reference Cobb2007). The form of the replication began with a review of existing literature on real rather than hypothetical L2 readers and their reading rates. These researchers questioned whether Cobb's proposed reading diet was actually particularly large for typical L2 learners, citing 11 reading-rate studies showing that real learners with lexicons of about 2,000 words can read a lot more than Cobb's 179,000 words of unsimplified text over a year, indeed rather more like 517,000 words. This would then mean that such learners would meet most of the third 1,000 target words enough times for learning and consolidation to occur.
Yet on actual inspection of McQuillan & Krashen's sources, (Cobb Reference Cobb2008), the posited larger amount of reading turns out to be the reading of simplified materials, not of academic or otherwise authentic texts. Thus, while more can undoubtedly be read using simplified materials, this is no guarantee that the particular target lexis (the third 1,000 word families, or beyond) is actually present in such materials. Most existing simplified texts focus on the first 1,000–2,000 word families, although other targets would be possible (see below). In other words, reading rates of these learners for authentic texts are presently unknown, and a true empirical replication of the Cobb (Reference Cobb2007) study with valid information about reading diet remains to be done.
Both the counting-up and the empirical types of replications are worth doing. Indeed both Nation and McQuillan & Krashen are the two parts of the single replication that is needed: a corpus-based work-out of the approximate learning opportunities, across the mid-frequency zones, in a range of relevant text types, that was based on a validated assessment of how much and what type of text learners at different levels are actually able to read. It is quite likely that the two parts will be engaged separately and assembled subsequently, although they could be done together, perhaps in a new research paradigm that might be called corporo-empirical research. It is important, however, that this work should be performed. The question of L2 vocabulary growth, particularly with regard to the demands of advanced level reading, has been unresolved for decades, but has now come in range and is answerable – in principle through research-informed corpus analysis.
Within these two main lines of replication, a number of variants could be usefully incorporated within an approximate replication methodology that might modify or even reverse the findings of Cobb (Reference Cobb2007). One possibility would be to vary the type of texts in the corpus. For example, a study might look at the learning opportunities in not just academic or other authentic texts but also in pedagogically modified texts. Indeed, Nation, in response to some of the issues raised above, has recently begun producing a complete set of graded readers that specifically target ‘mid-frequency’ vocabulary (that of the third to eighth 1,000 levels, as defined by Schmitt & Schmitt Reference Schmitt and Schmitt2014), such that significant numbers of word families in target k-level zones are met frequently, and mainly in environments of 98% known words. This work is described in Nation & Anthony (Reference Nation and Anthony2012), and in an undated information document by Nation (‘About mid-frequency readers,’ www.victoria.ac.nz/lals/about/staff/paul-nation). The first 13 of a projected 50 modified texts (nine fiction and four non-fiction), each simplified to a fourth, sixth, and eighth 1,000 word families target level, have been completed and are available on Nation's website. When completed, these 50 texts will form a corpus that will be eminently suitable for a combined corpus and empirical replication of Cobb (Reference Cobb2007) or indeed of Nation (Reference Nation2014).
Another variation on the text/corpus side that could form the basis of a useful replication would be to vary both text genre and degree of prior familiarity with the topic area. Neither Cobb's choice of Press, Academic, or Fiction sections of Brown, nor Nation's choice of out-of-copyright novels, even with lexical redesign, is typical reading for today's academic ESL learner. A corpus of what such learners are in fact reading could provide a very interesting replication of Cobb (Reference Cobb2007). A researcher might well find that academic ESL learners, reading in their domains, where they are building an accumulating knowledge base revolving around a limited number of themes (doing narrow reading), can indeed read larger amounts than those proposed by Cobb and can indeed build their lexicons substantially through reading. Variations on the learner/empirical side could be also undertaken, separately or in conjunction with variations on the corpus side. A particularly important element of the original study that could be usefully varied in an approximate replication is reading conditions. The assumption of convenience in Cobb (Reference Cobb2007), as in all coverage studies that we know of including those discussed in earlier parts of this paper, is that the reading occurs in unassisted conditions. But this is no longer how very many people actually read, particularly young people and students, given the click-on definitions, text-to-speech renditions, and Google searches beckoning within the web pages and PDF files on their computers and iPads. It is almost certain that more difficult texts can be comprehended and more vocabulary learned from resource integrated texts than with self-contained texts, but how much more? Changing the condition from unassisted to resource-assisted reading, perhaps in potential interaction with various kinds or degrees of training, one might well find that learners with emergent lexicons of about 2,000 word families can indeed read texts like Lady Chatterley with comprehension, enjoyment, and substantial vocabulary growth. Indeed, in the information document accompanying these designed-for-learning readers, Reference NationNation (n.d.) proposes precisely this method of engaging with his learning-enhanced texts.
The interesting questions then become, ‘How many words does one need to know to read texts connected to online resources?’ and ‘How many words can one learn from reading texts connected to online resources?’ Cobb (Reference Cobb2007) proposed that 170,000 words a year was a lot of reading for learners with 2,000 word families and that inferences of new word meanings would be hard going; but with resources, Nation's 300,000 or McQuillan & Krashen's 517,000 might well come in range, and easy look-ups make inferences straightforwardly confirmable. In other words, through these replications, Cobb's gloomy prognosis might be shown to be an artefact of a now largely defunct unassisted reading paradigm.
3. Conclusion
Knowing how much vocabulary is necessary to be functional in a language is crucial for setting vocabulary learning goals and designing syllabuses. Setting vocabulary size goals in which the English Language Teaching (ELT) community can be confident requires replication of both lexical coverage and vocabulary size studies, with related research into learning potential a useful adjunct. We can probably use the 95% and 98% coverage figures for written language, but for spoken discourse, there is a clear need to build upon the initial findings of van Zeeland & Schmitt (Reference van Zeeland and Schmitt2012). Nation (Reference Nation2006) provides clear learning targets for vocabulary size, but given the very considerable teaching and learning effort these substantial figures entail (6,000–7,000 word families for spoken discourse and 8,000–9,000 for written discourse), it is important to determine if these figures still hold true for other corpora, or if perhaps additional research will point to somewhat lower requirements (if we are lucky). Finally, we need to move beyond a merely corpus-driven discussion of the lexical needs for using language, and start looking into what learners can actually do with various vocabulary sizes, and how this affects their further learning, as Cobb (Reference Cobb2007) suggests. All of these replications should lead to a firmer establishment of vocabulary size requirements necessary to inform language pedagogy and assessment.
Acknowledgements
We wish to thank Mélodie Garnier, Benjamin Kremmel, Marijana Macis, Kholood Saigh, Michael Rodgers, Hilde van Zeeland, Laura Vilkaite and Paul Nation for their insightful comments on earlier versions of this paper.
Norbert Schmitt is Professor of Applied Linguistics at the University of Nottingham. He is interested in all aspects of second language vocabulary. He has published eight books (the latest being Researching vocabulary: A vocabulary research manual (2010, Palgrave Macmillan)), 50 journal articles, and 23 book chapters on various vocabulary topics. He currently sits on the editorial board of Language Testing. His personal website (www.norbertschmitt.co.uk) gives much more information about his research, and also provides a wealth of vocabulary resources for research and teaching.
Tom Cobb is Professor of Didactique des langues (or Applied Linguistics) at the University of Quebec at Montreal. He is interested in all aspects of second language vocabulary, focusing mainly on those that can be investigated or learned with the help of a computer. His Compleat Lexical Tutor website (www.lextutor.ca) has a host of resources for learners, teachers, and researchers as well as links to his research studies.
Marlise Horst is an associate professor in the Department of Education at Concordia University in Montreal, where she teaches courses in L2 vocabulary acquisition and the history of English for language teachers. Her current research explores opportunities to learn new L2 vocabulary via exposure to classroom input.
Diane Schmitt is a senior lecturer in EFL/TESOL at Nottingham Trent University and Chair of BALEAP. She teaches on the MA in English Language Teaching and also on a range of EAP courses. She has co-authored two textbooks on teaching vocabulary. Her areas of interest include: academic writing, plagiarism, vocabulary acquisition, language testing, materials development, and the international student experience.