How much vocabulary is needed to use English? Replication of van Zeeland &amp; Schmitt (2012), Nation (2006) and Cobb (2007)

Norbert Schmitt; Tom Cobb; Marlise Horst; Diane Schmitt

doi:10.1017/S0261444815000075

How much vocabulary is needed to use English? Replication of van Zeeland & Schmitt (2012), Nation (2006) and Cobb (2007)

Published online by Cambridge University Press: 02 March 2015

Norbert Schmitt ,

Tom Cobb ,

Marlise Horst and

Diane Schmitt

Show author details

Norbert Schmitt: Affiliation:
University of Nottinghamnorbert.Schmitt@nottingham.ac.uk
Tom Cobb: Affiliation:
University of Quebec at Montrealcobb.tom@uqam.ca
Marlise Horst: Affiliation:
Concordia Universitymarlise@education.concordia.ca
Diane Schmitt: Affiliation:
Nottingham Trent Universitydiane.schmitt@ntu.ac.uk

Article contents

Abstract
Introduction
The original studies and suggested approaches to replication
Conclusion
Footnotes
References

Rights & Permissions

Abstract

There is current research consensus that second language (L2) learners are able to adequately comprehend general English written texts if they know 98% of the words that occur in the materials. This important finding prompts an important question: How much English vocabulary do English as a second language (ESL) learners need to know to achieve this crucial level of known-word coverage? A landmark paper by Nation (2006) provides a rather daunting answer. His exploration of the 98% figure through a variety of spoken and written corpora showed that knowledge of around 8,000–9,000 word families is needed for reading and 6,000–7,000 for listening. But is this the definitive picture? A recent study by van Zeeland & Schmitt (2012) suggests that 95% coverage may be sufficient for listening comprehension, and that this can be reached with 2,000–3,000 word families, which is much more manageable. Getting these figures right for a variety of text modalities, genres and conditions of reading and listening is essential. Teachers and learners need to be able to set goals, and as Cobb's study of learning opportunities (2007) has shown, coverage percentages and their associated vocabulary knowledge requirements have important implications for the acquisition of new word knowledge through exposure to comprehensible L2 input. This article proposes approximate replications of Nation (2006), van Zeeland & Schmitt (2012), and Cobb (2007), in order to clarify these key coverage and size figures.

Type: Replication Studies
Information: Language Teaching , Volume 50 , Issue 2 , April 2017 , pp. 212 - 226

DOI: https://doi.org/10.1017/S0261444815000075 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2015

1. Introduction

People use language to communicate and express meaning, and this meaning is essentially conveyed by vocabulary. Thus knowledge of vocabulary is fundamental to all language use, and so must be learned in some manner in order for learners to become communicative in a new language. However, the lexicons of most languages are very large. For example, Goulden, Nation & Read (Reference Goulden, Nation and Read1990) estimated there are 54,000 word families in English. Given that most word families have several members (e.g. stimulate, stimulated, stimulating, stimulates, stimulation, stimulative), this translates into many hundreds of thousands of individual word forms.Footnote ¹ Even very proficient speakers will not know all of these words,Footnote ² and Goulden et al. found that their New Zealand university undergraduates had an English vocabulary size of about 17,000 word families. This is still far out of reach for most second language (L2) learners, and it is not surprising that L2 teachers and textbook writers struggle with the sheer number of words that could be taught. What is required for pedagogical purposes are descriptions of the amount of vocabulary which is necessary to be functional in specific communicative contexts, not to match the language level of native speakers.

In order to generate such descriptions, two things are required. First, one must know what percentage of the vocabulary in a stretch of spoken or written discourse needs to be known by a learner in order for him or her to understand the discourse. This is known as lexical coverage. People can usually understand speech or writing even if there are a few unknown words, so 100% coverage is not typically necessary. But if too many words are unknown, comprehension is compromised and listening or reading becomes a chore. What percentage of words should be known? Most research suggests that coverage in the range of 95–98% is adequate for acceptable comprehension, or in other words, that acceptable comprehension can be achieved with 2–5% of the words unknown (e.g. Hu & Nation Reference Hu and Nation2000; Laufer & Ravenhorst-Kalovski Reference Laufer and Ravenhorst-Kalovski2010; van Zeeland & Schmitt Reference van Zeeland and Schmitt2012). Second, when a coverage figure is established, one must determine how many specific words this corresponds to. For example, a typical finding is that 98% coverage in written texts corresponds to knowledge of about 8,000 word families (e.g. Nation Reference Nation2006). We will refer to the number of words needed to meet a lexical coverage percentage in various communicative contexts as vocabulary size.

Another factor is pedagogical practicality. Since the number of words that can be taught explicitly in language classes is limited, studies would do well to include an empirical consideration of learners’ capacity to acquire new vocabulary incidentally, through exposure to reading and listening input. Cobb (Reference Cobb2007), in a study using corpus coverage to calculate learning opportunities, showed that learners’ typically small vocabulary sizes of 2,500–3,000 word families can be partially explained by the very low rate at which words at subsequent frequency levels (i.e. 3,000+) occur in texts. Words must normally be met a certain number of times (a figure of 10 is often cited), but Cobb argues that this number of repetitions is typically not available for less frequent words. If his analysis is correct, then the notion that written language comprehension depends on lexicons on the scale of 8,000 word families amounts to a discouraging picture for learners, since relatively few of them will arrive at that figure. However, learners may well be able to cope with lower levels of lexical coverage/vocabulary size than the research suggests. They may be able to do this through using resources like dictionaries and various forms of online support. If so, the coverage/size figures may be set higher than necessary for real-world use, and new ‘resource aided’ figures need to be developed.

It is crucial to have good estimates of the vocabulary sizes necessary to be functional in specific contexts and uses of a language, because these estimates form learning targets for language students. An estimate that is too low could lead to a lowering of pedagogical goals such that learners would not acquire a vocabulary large enough to make competent language use possible. An estimate that is too high would be unnecessarily demotivating for learners, and may include words that are so infrequent that they have little practical utility in normal language use. There are a limited number of studies informing these essential size targets, and so it is vital to replicate and expand upon the ones we have. This paper will suggest replication of studies of lexical coverage (van Zeeland & Schmitt Reference van Zeeland and Schmitt2012), vocabulary size (Nation Reference Nation2006), and plausible learnability (Cobb Reference Cobb2007) in order to develop a more reliable, nuanced, and ecologically valid understanding of the amount of vocabulary learners need to acquire in order to become proficient language users in their chosen domains.

2. The original studies and suggested approaches to replication

2.1 Lexical coverage of spoken discourse (van Zeeland & Schmitt Reference van Zeeland and Schmitt2012)

Most research on lexical coverage in relation to L2 comprehension has been conducted on reading, and we now have a fairly good idea of the percentage of vocabulary that needs to be known to allow comprehension of written text. The earliest research in this area indicated that 95% lexical coverage was needed (Laufer Reference Laufer, Lauren and Nordman1989). In real terms, 5% unknown vocabulary equates to about one unknown word in roughly every two lines of text, and therefore over 15 unknown words on every page. Thus, it is perhaps not surprising that subsequent research by Hu & Nation (Reference Hu and Nation2000) suggested a higher coverage figure closer to 98%. More recently, Schmitt, Jiang & Grabe (Reference Schmitt, Jiang and Grabe2011) investigated each percentage point of coverage between 90 and 100%, in an attempt to describe the overall relationship between coverage and comprehension. This revealed a linear relationship between the two, which suggests that the coverage level required depends on the degree of comprehension aimed for. Based on their data, if 60% comprehension is the goal, 98% lexical coverage is needed. Laufer & Ravenhorst-Kalovski (Reference Laufer and Ravenhorst-Kalovski2010) support the idea of basing the required coverage level on the reading comprehension wished for (e.g. if a learner's goal is to read a second language novel for pleasure, then 100% comprehension may not be needed or worth the learning investment). These authors suggest two lexical coverage thresholds, depending on the definition of ‘adequate’ comprehension: 98% as the sufficient and 95% as the minimal. Based on the performance of their Israeli participants, the authors conclude that 95% coverage enables acceptable comprehension and is probably viable with some support (e.g. teacher or learner resources such as dictionaries) and that 98% coverage leads to successful comprehension by most learners, and is likely to enable independent reading. Overall, the consensus is that about 98% is the lexical coverage which is most appropriate for most purposes involving written text.Footnote ³

In contrast, there has been very little research into the lexical coverage required for listening. The main study to date has been van Zeeland & Schmitt (Reference van Zeeland and Schmitt2012). They had their ESL (mixed L1) participants listen to four anecdotes told in the first person about people getting into unusual situations. The stories had various percentages of words replaced with nonwords (0, 2, 5, and 10%), so that percentages of known vocabulary in the stories were precisely 100, 98, 95, and 90% respectively. Participants’ comprehension was measured by a ten-item multiple choice test for each anecdote. The researchers found that the participants who knew 100% of the words in a story had a mean score of 9.62, those knowing 98% of the words scored 8.22, those with 95% had 7.65, and those with 90% knowledge scored 7.35. Overall, knowledge of greater percentages of the vocabulary in the stories led to better comprehension, and thus test scores. However, even at the 95% and 90% knowledge levels, the comprehension was still quite good in absolute terms and good enough for many practical purposes. There was no statistical difference in the test scores between the 95% and 90% knowledge levels, but the standard deviations were large at the 90% level, indicating that there was real variability in learners’ ability to cope with this low amount of lexical coverage. Van Zeeland & Schmitt thus concluded that 95% lexical coverage was the more reasonable criterion for adequate comprehension, because at this level, the performances were much more consistent among the participants.

The van Zeeland & Schmitt study is a good start, but replications could usefully address its inevitable limitations. We suggest approximate replications of this study (Porte Reference Porte and Porte2012), where certain variables might be changed to determine how generalizable the original paper's results are and to either strengthen or challenge the conclusions of that paper. Van Zeeland & Schmitt used informal narratives of about two minutes length in their study, with repeated listenings, and they acknowledge that their results might be viewed as ‘best-case’ performance. The most obvious variable to start exploring is the type of listening. Narratives typically have a straightforward chronological structure, which should make listening easier than, say, a lecture or a detailed explanation. This is especially true because listeners rely more on top-down processing than readers (Lund Reference Lund1991; Park Reference Park2004). This suggests that listening comprehension may be largely based on factors such as world knowledge and topic familiarity. Such top-down information is believed to be compensatory in use, in the sense that it can be employed strategically by listeners to compensate for inadequate knowledge of the L2 or an inability to recognize words in continuous speech (Field Reference Field2004; Vandergrift Reference Vandergrift and Hinkel2011). Thus passages with more obvious organization (such as narratives) should be easier to comprehend when listening, and indeed narratives have been found to be the most comprehensible genre for listeners (Rubin Reference Rubin1994). It is an open question whether types of discourse with a less obvious organization (e.g. everyday chat, jokes, political speeches) can also be comprehended with 95% coverage, and these types should be explored. It is not always obvious a priori whether they might require higher or lower lexical percentages for comprehension. For example, everyday chat, with its numerous digressions and topic changes, might require a higher lexical coverage for comprehension. Conversely, the greater opportunities for questions and clarification might allow comprehension with a lower percentage of coverage.

Another variable which could be usefully explored is the length of listening. Van Zeeland & Schmitt's passages were relatively long, in experimental terms, at about two minutes. However, many real-world listening contexts, such as attending to academic lectures, political speeches, and radio talk shows require much longer periods of concentration. It would be interesting to determine whether truly extended listening contexts (20+ minutes) would be comprehensible with 95% coverage, or whether the unknown words would eventually begin to affect comprehension, or at least make listening onerous. On the other hand, it might be that as the sense of the message begins to accumulate, more top-down processing can come into play and listening becomes easier.

Another decision made by the researchers was to allow the participants to listen to the passages twice, in order to avoid memory affecting the comprehension results. However, as most listening is a one-off affair, it would be interesting to know whether single listenings would also be consistently comprehensible at 95% coverage. A straightforward way of assessing this would be to repeat the van Zeeland & Schmitt study, but have some participants listen to the passages twice, as in the original study, and others only once. If the single listeners do much worse, then this might indicate that the 95% coverage figure is too optimistic, and the coverage figures from the single listenings might be a more appropriate indication of the necessary lexical coverage required for non-interactive listening.

Finally, the researchers decided not to allow participants to ask questions. This is convenient for research (as a way of equalizing the participants, since some will ask more questions than others due to personality factors) but it comes at some cost in ecological validity. Only a small proportion of real-life interpersonal listening takes place with no option to interact or ask questions (although this clearly is the case with media exposure such as listening to radio, TV, movies, online lectures, YouTube videos, etc.). In any case, the study should be replicated under more typical interpersonal conditions, which could be predicted to lower the coverage needed to a point below 95%.

2.2 Vocabulary size (Nation Reference Nation2006)

Perhaps the most important vocabulary size figures to establish definitively are those relating to L2 learners’ ability to be functional in general English in both the written and spoken modes. Once established, these figures should inform all non-specialized (e.g. non-English for Special Purposes (ESP)) English teaching pedagogy and materials design. To date the most influential paper in this area is undoubtedly Nation's 2006 study. Using a mini-corpus of five English novels (Lord Jim, Lady Chatterley's Lover, The Turn of the Screw, The Great Gatsby, and Tono-Bungay), Nation calculated that a learner would be required to know about 4,000 of the most frequent word families plus proper nouns to reach 95% lexical coverage, and around 8,000–9,000 families plus proper nouns to reach 98% coverage. He found similar figures for a corpus of newspapers. Turning to unscripted spoken English, Nation used two parts of the Wellington Corpus of Spoken English (n.d.). One part included talk-back radio, where listeners phone in with their spontaneous comments on the issue being discussed, and the other was made up of friendly conversation between family members and friends. Nation found that about 3,000 word families plus proper nouns provided more than 95% coverage, but that it took 6,000–7,000 word families to reach 98% coverage. He also investigated the movie Shrek, for which it took 4,000 word families plus proper nouns to reach 95% lexical coverage, and 7,000 to reach the 98% level. These figures are broadly in line with Webb & Rodgers’ findings based on the scripted talk in movies (Reference Webb and Rodgers2009a) and television (Reference Webb and Rodgers2009b).Footnote ⁴

To be clear, this research did not involve actual learners with knowledge of the 4,000 or 8,000 highest frequency word families; nor did it involve measuring learners’ comprehension. Instead, it used frequency profiling of Nation's target text collections and corpora to determine the number of word families that learners would hypothetically need to know to achieve a particular level of known word coverage (e.g. 98%). Nevertheless, Nation's vocabulary size figures (based on 98% coverage) of 6,000–7,000 word families for spoken English and 8,000–9,000 for written English are very widely cited. Given the impact of his study, it is important to replicate it to confirm (or revise) those figures.

An approximate replication approach also seems appropriate to address Nation's study as there are a number of variables that could usefully be manipulated. For initial replications, we propose to leave unchanged Nation's methodology for deriving vocabulary size, which uses his British National Corpus (BNC)-based word family lists as counting units, as it is well established and has proven its usefulness. What is needed are replications of Nation that use this same methodology but test much larger corpora of general English. Nation used rather small data sets in his influential study: the single novel Lady Chatterley's Lover (121,000 words), the five novels mentioned above taken together (474,000 words), a newspaper corpus (440,000 words), the script of the movie Shrek (10,000 words), and two parts of the Wellington Corpus of Spoken English (around 200,000 words). Nation's purpose was to determine the vocabulary sizes necessary to read and listen to general English in various contexts of use, so testing these various small sample corpora made sense. However, he conflated the individual results in order to come up with the overall vocabulary size figures discussed above. It is these global figures (6,000–7,000 word families for listening; 8,000–9,000 word families for reading) which now need to be checked with larger, more comprehensive corpora.

Two large current corpora against which the coverages of Nation's lists could usefully be tested are the Corpus of Contemporary American English (COCA) and the Corpus of Global Web-based English (GloWbE). The COCA was developed by Mark Davies and currently contains more than 450 million words, including 20 million words gathered each year from 1990 to 2012 (as of 22 August 2014). It is a balanced corpus, being equally divided among five genres/registers: spoken, fiction, popular magazines, newspapers, and academic journals. Importantly, it is not static, as it is updated at least twice each year, which promises to keep it current, instead of being a ‘snapshot’ of English at a single point in time like the BNC. The COCA is thus an excellent corpus from which to derive vocabulary size information. It is now available to be fully downloaded onto one's personal computer at http://corpus.byu.edu/coca/, which makes the suggested replications eminently feasible.

The GloWbE is a brand-new corpus (also created by Mark Davies and released in April 2013) consisting of 1.9 billion words from 1.8 million web pages in 20 different English-speaking countries. Given the importance of the internet for global communication and information transfer, it would be very interesting to determine how much vocabulary knowledge is necessary to comprehend this resource at the 95% and 98% levels of coverage, especially as it is so diverse and dynamic. It can be accessed at http://corpus2.byu.edu/glowbe/ and is also fully downloadable.

The BNC is another corpus possibility. At 100 million words, it is smaller than either the COCA or GloWbE. It is also becoming dated, as it was compiled from a range of sources in the latter part of the 20th century. However, its ten-million word spoken component is relatively large for an unscripted spoken corpus, and this could be useful for exploring the requirements for spoken English. The BNC can be consulted at www.natcorp.ox.ac.uk/ or http://corpus.byu.edu/bnc/.

Another variable that could be explored is the word lists which are used to interrogate the target corpora. Nation (Reference Nation2006) used a word list based on the BNC, but this British-English based metric may be due for updating and revision. Nation recently took a step in this direction by updating his original BNC-based frequency lists using COCA frequency information. His goal was to further increase the generality of his original lists by reducing their British bias and making them more applicable to both British and American contexts. (See Nation Reference Nation2012a, available online, for details of the procedure.) The differences between the new and old lists are extensive and this has implications for the previously established coverage levels. Assuming the new combined BNC-COCA lists are a better indication of word frequency, then everything that has been done using the original BNC-based lists is ripe for replication using these new lists. Such replications may well change the established picture considerably. For instance, applying these more American lists could result in a downwards revision of the figure of 8,000 – the word size needed to achieve the 98% coverage level according to Nation's (Reference Nation2006) investigation of materials that included the American novels The Turn of the Screw and The Great Gatsby and the American film script Shrek. In other words, in these replications, American-English texts would no longer be analysed using a British-English frequency scheme, and this might well reveal that the levels of coverage reported in Nation (Reference Nation2006) can be reached with smaller vocabulary sizes.

Another proposal pertaining to the counting unit used in investigations of coverage is to replace word families altogether, and use lemmas instead. Although lemma-based studies would not be comparable with the range of family-based studies the field is built upon, there would be a number of advantages. In particular, lemmas might be a suitable counting unit for research focusing on vocabulary pedagogy, as learners do not typically know all word family members (e.g. Schmitt & Zimmerman Reference Schmitt and Zimmerman2002; see Schmitt Reference Schmitt2010, Section 5.2.1, for a full discussion of pros and cons of different counting units). A lemmatized frequency list of the complete COCA has recently been made available (up to the first 100,000 words) by Mark Davies on his COCA website (www.wordfrequency.info/intro.asp). A comparison study of vocabulary size using one of Nation's word family wordlists and Davies's lemma wordlist would be interesting indeed, and would help in interpreting how generalizable Nation's word family figures are for pedagogical purposes. Since lemmatized lists are more easily made automatically via alphabetical grouping than family lists (which require manual work, e.g. to add unhappy to the happy family), this avenue of research could also identify important efficiencies in creating new lists from ever-evolving and dynamic corpora.Footnote ⁵

An extension to the vocabulary size issue would be to determine whether Nation's vocabulary coverage and size figures as calculated for general English also pertain to more specialized domain-specific contexts. That is, can we generalize his vocabulary requirements for general English to more specific domains, e.g. within Academic English or Professional English? A small number of studies (e.g. Hsu Reference Hsu2011, Dang & Webb Reference Dang and Webb2014) suggest that this is not particularly straightforward. The studies show that size figures for the same levels of coverage differ depending on the degree of specificity. For example, Dang & Webb (Reference Dang and Webb2014) found that while 8,000 word families achieved 98% coverage of a general academic corpus, the subdiscipline requirements ranged from 5,000 (social sciences) to 13,000 (life and medical sciences). This implies that it is necessary to develop specific size requirements tailored to various domains and contexts. It is beyond the remit of this article on replication to give details about how to operationalize this extension of Nation's research. Instead, we direct readers’ attention to Hsu (Reference Hsu2014), who provides a useful model of how this type of research might be carried out. Focusing on engineering, she profiled a 4.57 million-word corpus of English engineering textbooks. Using Nation's general BNC/COCA 25K lists, she found that knowledge of the first 2K plus proper nouns provided just 80.7% coverage of the corpus, and that 5,000 word families were needed to reach 95% coverage. Given that in Taiwan knowledge of 2,000 general service words is a high school graduation threshold, Hsu's aim was to develop a word list that would fill the 14.3% gap in coverage between the 2K level and 95% coverage. The resulting Engineering English Word List consists of just 729 word families, considerably fewer than the 3,000 general word families needed to achieve similar coverage. This study is a good example of how the principles of coverage and size discussed in our article can be used to deliver very focused and pedagogically useful vocabulary learning targets for particular domains and learning contexts.

2.3 Coverage, size, and learning (Cobb Reference Cobb2007)

Size and coverage can be used to calculate not just comprehension, but also how much vocabulary it is possible or probable to learn from a particular text. Comprehension and learnability are interrelated, with the 95–98% coverage figures commonly cited as ‘lexical thresholdsʼ for both, although there is only one empirical finding that we know of to support the learning aspect. Swanborn & De Glopper (Reference Swanborn and de Glopper1999) undertook a meta-analysis of 20 studies of word learning in first language (L1) reading. They determined that known-unknown word ratios were a significant predictor of successful inferencing, and located some evidence for a threshold at one unknown word in 37 known (or, when 97% of the words are known). This coverage figure for successful inferencing is remarkably close to the 98% figure identified for L2 comprehension. However, beyond Swanborn & De Glopper, the coverage-inferencing-learning link is largely a common sense intuition at this point, and would be a fruitful area for research. But what seems sure is that learners with a typical vocabulary size of 2,000 word families will not comprehend or learn much new vocabulary from a text with more than 10% of its vocabulary beyond the most frequent 3,000 word families in English (3K), as is typically the case with all but simplified texts. Nor will they consolidate any correct inferences they do manage to make with texts that recycle these words only once or twice, again as is typically the case.

Cobb's study looked at this learnability issue and started from the question, ‘Can an adequate L2 reading lexicon be built from reading alone?’ His methodology made the following choices:

• He used Nation's (Reference Nation2006) BNC-based frequency lists as a source of learning objectives, in this case the third 1,000 word families.
• He hypothesized as the participant of his study a typical academic ESL learner with a vocabulary size of just over 2,000 words (2,112 word families, sd = 1,036, is a rough international average for academic ESL learners, according to Laufer's (Reference Laufer2000) census of English for academic purposes (EAP) and ESP instructors in eight countries).
• He assumed a year of study as the learning period, one year being the typical allowance in ESP-EAP situations, or two in rare situations.
• He assumed a maximum total yearly reading diet of either the Press, Academic, or Fiction divisions of the Brown corpus (179,000; 163,000; and 175,000 word tokens respectively, any of which equals about six stories the size of Alice in Wonderland or about 25 academic studies. The nature of the sub-corpora was intended to represent roughly the types of texts academic learners might be assigned to read, and the amount to be a generous estimate of what such students actually would read (remembering that with lexicons of 2,500 words, such texts would be presenting at least one unknown word in ten and hence be rather arduous to get through).

The specific research question, then, was how many of the most frequent 3,000 word families in English (3K) are present in each of these collections, and in what coverage proportions, from the perspective of a learner with knowledge of just over 2,000 words. Random samples from each of these first, second, and third 1,000 levels of the BNC lists were matched against the contents of the three hypothesized reading diets. The finding was that while the first and second 1,000 word families are well represented in any of the diets, the third 1,000 families thin out rather dramatically in all of them, with only about half appearing even six times. (Eight to ten times seems to be the minimum figure for reliable incidental learning indicated by the research, e.g. Horst, Cobb & Meara Reference Horst, Cobb and Meara1998). In other words, not much progress with the third 1,000 words could be expected from this presumably substantial exposure to natural (ungraded) text, at least not in the year or sometimes two that are normally available. By implication, then, pedagogies other than reading alone are needed to assure adequate progress toward the coverage objectives discussed in earlier parts of this paper.

This somewhat pessimistic finding about vocabulary growth from reading is controversial to say the least. First, it goes against the view, once almost universally held and still common amongst practitioners, that contextual inference from self-selected input is sufficient for all levels of vocabulary development (e.g. Krashen Reference Krashen1985). Second, it is a type of study that has not been undertaken before. Nation (Reference Nation2014: 1) called it a ‘notable exceptionʼ to a general lack of corpus-based studies of the feasibility of learning large amounts of foreign language vocabulary through reading, and, as already mentioned, it is novel in attempting to extend coverage analysis from comprehension to acquisition. Thus, for reasons of both controversy and originality, this study's findings will need substantial further investigation before they are accepted, ideally in the form of replication studies rather than just commentary and discussion. The replication should ideally be of two types: one that varies the data of Cobb (Reference Cobb2007) and another that varies the assumptions about how or how much learners can read. Varying the data might involve, for example, pitching larger samples of second- and third-thousand words or other levels of words against different and possibly more representative corpora; varying the assumptions might involve basing calculations on different ideas about the amount of reading the hypothesized learners are able to perform in a year, or even empirical evidence, which would amount to replacing hypothetical with real learners. In fact, both types of replications of Cobb (Reference Cobb2007) have already been undertaken, and these will now be summarized and any remaining questions identified.

On the corpus side, Nation in a series of conference presentations (Reference Nation2012b), a working paper entitled ‘How much input do you need to learn the most frequent 9,000 words?ʼ, and a published paper with the same title (2014), vastly extended the scale of Cobb's (Reference Cobb2007) corpus analysis, pitching each of the first nine k-levels against a 3-million token corpus of novels and other types of corpora, discovering in some detail the number of each level of words that would be encountered per unit of time assuming different reading rates. For example, if 300,000 words of the novels corpus were read, then 830 of the third 1,000 most frequent words would be met an average of 12.6 times (and so on up to just under 3 million words for most of the ninth 1,000), which is roughly in line with Cobb's proposal that with 167,000 words an insufficient number of third 1,000 words would be met for systematic acquisition. Nation, however, went on to propose that 300,000 words of reading is, in fact, possible, provided that (1) the reading goals were set higher than they are now (although ‘there is no published research to support [these proposed reading figures] for learners of English as a foreign languageʼ, p. 7); and (2) the texts involved ‘were at the right level for [the learners] so that the target words would make up around 2% or less of the running words in the textʼ (p. 7) with unknown words therefore met in a ratio of about one unknown in 50 known. Natural text, however, in most cases will not provide unknown words in such a friendly ratio, so the point is moot. For a learner who knows 2,000 word families, even a novel such as Lady Chatterley will comprise almost 7% post-fourth 1,000 items, such that unknown words will be met in a ratio of at least one unknown in 15 (roughly one per two lines of text). Further, many types of texts are even more lexically sophisticated than the novels on which these figures are based, and still further come in less inference-friendly formats than the chronological flow of everyday events which typifies novels. To summarize, then, the corpus part of Nation's replication confirms and extends Cobb's initial and in retrospect pilot-level work, but the proposal part simply highlights the need for empirical work on how much of what type of texts learners can actually read.

This second type of replication, on the learner side, is the topic of McQuillan & Krashen's (Reference McQuillan and Krashen2008) direct response to Cobb (Reference Cobb2007). The form of the replication began with a review of existing literature on real rather than hypothetical L2 readers and their reading rates. These researchers questioned whether Cobb's proposed reading diet was actually particularly large for typical L2 learners, citing 11 reading-rate studies showing that real learners with lexicons of about 2,000 words can read a lot more than Cobb's 179,000 words of unsimplified text over a year, indeed rather more like 517,000 words. This would then mean that such learners would meet most of the third 1,000 target words enough times for learning and consolidation to occur.

Yet on actual inspection of McQuillan & Krashen's sources, (Cobb Reference Cobb2008), the posited larger amount of reading turns out to be the reading of simplified materials, not of academic or otherwise authentic texts. Thus, while more can undoubtedly be read using simplified materials, this is no guarantee that the particular target lexis (the third 1,000 word families, or beyond) is actually present in such materials. Most existing simplified texts focus on the first 1,000–2,000 word families, although other targets would be possible (see below). In other words, reading rates of these learners for authentic texts are presently unknown, and a true empirical replication of the Cobb (Reference Cobb2007) study with valid information about reading diet remains to be done.

Both the counting-up and the empirical types of replications are worth doing. Indeed both Nation and McQuillan & Krashen are the two parts of the single replication that is needed: a corpus-based work-out of the approximate learning opportunities, across the mid-frequency zones, in a range of relevant text types, that was based on a validated assessment of how much and what type of text learners at different levels are actually able to read. It is quite likely that the two parts will be engaged separately and assembled subsequently, although they could be done together, perhaps in a new research paradigm that might be called corporo-empirical research. It is important, however, that this work should be performed. The question of L2 vocabulary growth, particularly with regard to the demands of advanced level reading, has been unresolved for decades, but has now come in range and is answerable – in principle through research-informed corpus analysis.

Within these two main lines of replication, a number of variants could be usefully incorporated within an approximate replication methodology that might modify or even reverse the findings of Cobb (Reference Cobb2007). One possibility would be to vary the type of texts in the corpus. For example, a study might look at the learning opportunities in not just academic or other authentic texts but also in pedagogically modified texts. Indeed, Nation, in response to some of the issues raised above, has recently begun producing a complete set of graded readers that specifically target ‘mid-frequency’ vocabulary (that of the third to eighth 1,000 levels, as defined by Schmitt & Schmitt Reference Schmitt and Schmitt2014), such that significant numbers of word families in target k-level zones are met frequently, and mainly in environments of 98% known words. This work is described in Nation & Anthony (Reference Nation and Anthony2012), and in an undated information document by Nation (‘About mid-frequency readers,’ www.victoria.ac.nz/lals/about/staff/paul-nation). The first 13 of a projected 50 modified texts (nine fiction and four non-fiction), each simplified to a fourth, sixth, and eighth 1,000 word families target level, have been completed and are available on Nation's website. When completed, these 50 texts will form a corpus that will be eminently suitable for a combined corpus and empirical replication of Cobb (Reference Cobb2007) or indeed of Nation (Reference Nation2014).

Another variation on the text/corpus side that could form the basis of a useful replication would be to vary both text genre and degree of prior familiarity with the topic area. Neither Cobb's choice of Press, Academic, or Fiction sections of Brown, nor Nation's choice of out-of-copyright novels, even with lexical redesign, is typical reading for today's academic ESL learner. A corpus of what such learners are in fact reading could provide a very interesting replication of Cobb (Reference Cobb2007). A researcher might well find that academic ESL learners, reading in their domains, where they are building an accumulating knowledge base revolving around a limited number of themes (doing narrow reading), can indeed read larger amounts than those proposed by Cobb and can indeed build their lexicons substantially through reading. Variations on the learner/empirical side could be also undertaken, separately or in conjunction with variations on the corpus side. A particularly important element of the original study that could be usefully varied in an approximate replication is reading conditions. The assumption of convenience in Cobb (Reference Cobb2007), as in all coverage studies that we know of including those discussed in earlier parts of this paper, is that the reading occurs in unassisted conditions. But this is no longer how very many people actually read, particularly young people and students, given the click-on definitions, text-to-speech renditions, and Google searches beckoning within the web pages and PDF files on their computers and iPads. It is almost certain that more difficult texts can be comprehended and more vocabulary learned from resource integrated texts than with self-contained texts, but how much more? Changing the condition from unassisted to resource-assisted reading, perhaps in potential interaction with various kinds or degrees of training, one might well find that learners with emergent lexicons of about 2,000 word families can indeed read texts like Lady Chatterley with comprehension, enjoyment, and substantial vocabulary growth. Indeed, in the information document accompanying these designed-for-learning readers, Reference NationNation (n.d.) proposes precisely this method of engaging with his learning-enhanced texts.

The interesting questions then become, ‘How many words does one need to know to read texts connected to online resources?’ and ‘How many words can one learn from reading texts connected to online resources?’ Cobb (Reference Cobb2007) proposed that 170,000 words a year was a lot of reading for learners with 2,000 word families and that inferences of new word meanings would be hard going; but with resources, Nation's 300,000 or McQuillan & Krashen's 517,000 might well come in range, and easy look-ups make inferences straightforwardly confirmable. In other words, through these replications, Cobb's gloomy prognosis might be shown to be an artefact of a now largely defunct unassisted reading paradigm.

3. Conclusion

Knowing how much vocabulary is necessary to be functional in a language is crucial for setting vocabulary learning goals and designing syllabuses. Setting vocabulary size goals in which the English Language Teaching (ELT) community can be confident requires replication of both lexical coverage and vocabulary size studies, with related research into learning potential a useful adjunct. We can probably use the 95% and 98% coverage figures for written language, but for spoken discourse, there is a clear need to build upon the initial findings of van Zeeland & Schmitt (Reference van Zeeland and Schmitt2012). Nation (Reference Nation2006) provides clear learning targets for vocabulary size, but given the very considerable teaching and learning effort these substantial figures entail (6,000–7,000 word families for spoken discourse and 8,000–9,000 for written discourse), it is important to determine if these figures still hold true for other corpora, or if perhaps additional research will point to somewhat lower requirements (if we are lucky). Finally, we need to move beyond a merely corpus-driven discussion of the lexical needs for using language, and start looking into what learners can actually do with various vocabulary sizes, and how this affects their further learning, as Cobb (Reference Cobb2007) suggests. All of these replications should lead to a firmer establishment of vocabulary size requirements necessary to inform language pedagogy and assessment.

Acknowledgements

We wish to thank Mélodie Garnier, Benjamin Kremmel, Marijana Macis, Kholood Saigh, Michael Rodgers, Hilde van Zeeland, Laura Vilkaite and Paul Nation for their insightful comments on earlier versions of this paper.

Norbert Schmitt is Professor of Applied Linguistics at the University of Nottingham. He is interested in all aspects of second language vocabulary. He has published eight books (the latest being Researching vocabulary: A vocabulary research manual (2010, Palgrave Macmillan)), 50 journal articles, and 23 book chapters on various vocabulary topics. He currently sits on the editorial board of Language Testing. His personal website (www.norbertschmitt.co.uk) gives much more information about his research, and also provides a wealth of vocabulary resources for research and teaching.

Tom Cobb is Professor of Didactique des langues (or Applied Linguistics) at the University of Quebec at Montreal. He is interested in all aspects of second language vocabulary, focusing mainly on those that can be investigated or learned with the help of a computer. His Compleat Lexical Tutor website (www.lextutor.ca) has a host of resources for learners, teachers, and researchers as well as links to his research studies.

Marlise Horst is an associate professor in the Department of Education at Concordia University in Montreal, where she teaches courses in L2 vocabulary acquisition and the history of English for language teachers. Her current research explores opportunities to learn new L2 vocabulary via exposure to classroom input.

Diane Schmitt is a senior lecturer in EFL/TESOL at Nottingham Trent University and Chair of BALEAP. She teaches on the MA in English Language Teaching and also on a range of EAP courses. She has co-authored two textbooks on teaching vocabulary. Her areas of interest include: academic writing, plagiarism, vocabulary acquisition, language testing, materials development, and the international student experience.

Footnotes

¹ Michel et al. (Reference Michel, Yuan, Aiden, Veres, Gray, Pickett, Hoiberg, Clancy, Norvig, Orwan, Pinker, Nowak and Aiden2011) found 1,022,000 individual word forms in a computer analysis of a corpus of 5,195,769 digitized books.

² There are a number of units with which to count vocabulary, with the most common being individual word form, lemma, and word family. These counting units have often been ill-defined or conflated in the literature, and so for convenience, we will use ‘word’ as our general cover term in this paper. In cases where a more specific counting unit is possible/appropriate (e.g. word family), we will use that term. Another complexity concerns formulaic language. A great deal of language is made up of multiword units, but this formulaic language is seldom included in vocabulary counts. How to deal with formulaic language in counting, teaching, and testing vocabulary remains one of the most pressing issues in the field of vocabulary studies. However, as the measurement of the size and knowledge of formulaic language is still in its infancy, it will not be focused on in this paper dealing with replication. (See Schmitt Reference Schmitt2010 for details on all of these issues.)

³ One weakness of coverage figures is that they do not distinguish the importance of the individual words in a text. If unknown words are crucial to the understanding of a text (and this is likely for at least some), then high coverage figures will not by themselves ensure adequate comprehension. Research on the effect of particular words on comprehension has not yet been done, and is outside the remit of this paper on replication. However, the issue is probably central for the coverage argument, and needs to be developed as a complementary research strand.

⁴ The relatively congruent size requirements for a variety of written and spoken text types is partially a reflection of Zipf's law, where the most frequent words will always make up the majority of a text of any type.

⁵ Although there is not space to elaborate here, we feel that careful selection of appropriate counting units is essential for any future study, replication or not. Counting units need to be chosen in a principled manner, without word families being the automatic default just because they have been used before. It is also worth noting that although compiling word family lists requires an enormous amount of work in the first instance, once developed, their re-use requires relatively little editing.

References

Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language Learning & Technology 11.3, 38‒63.Google Scholar

Cobb, T. (2008). What the reading rate research does not show: Response to McQuillan & Krashen. Language Learning & Technology 12.3, 109‒114.Google Scholar

Dang, T. G. Y. & Webb, S. (2014). The lexical profile of academic spoken English. English for Specific Purposes 33, 66‒66.CrossRef Google Scholar

Field, J. (2004). An insight into listeners’ problems: Too much bottom-up or too much top-down? System 32, 363–377.CrossRef Google Scholar

Goulden, R., Nation, P. & Read, J. (1990). How large can a receptive vocabulary be? Applied Linguistics 11, 341‒341.CrossRef Google Scholar

Horst, M., Cobb, T. & Meara, P. (1998). Beyond A Clockwork Orange: Acquiring second language vocabulary through reading. Reading in a Foreign Language 11, 207‒207.Google Scholar

Hsu, W. (2011). The vocabulary thresholds of business textbooks and business research articles for EFL learners. English for Specific Purposes 30, 247‒247.CrossRef Google Scholar

Hsu, W. (2014). Measuring the vocabulary load of engineering textbooks for EFL undergraduates. English for Specific Purposes 33, 54‒54.CrossRef Google Scholar

Hu, M. & Nation, I. S. P. (2000). Vocabulary density and reading comprehension. Reading in a Foreign Language 23, 403‒403.Google Scholar

Krashen, S. (1985). The Input Hypothesis: Issues and implications. London: Longman.Google Scholar

Laufer, B. (1989). What percentage of text-lexis is essential for comprehension? In Lauren, C. & Nordman, M. (eds.), Special language: From humans thinking to thinking machines. Clevedon: Multilingual Matters, 316‒316.Google Scholar

Laufer, B. (2000). Task effect on instructed vocabulary learning: The hypothesis of ‘involvement’. Selected Papers from AILA ’99 Tokyo. Tokyo: Waseda University Press, 47‒62.Google Scholar

Laufer, B. & Ravenhorst-Kalovski, G. C. (2010). Lexical threshold revisited: Lexical text coverage, learners’ vocabulary size and reading comprehension. Reading in a Foreign Language 22, 15–30.Google Scholar

Lund, R. J. (1991). A comparison of second language listening and reading comprehension. The Modern Language Journal 75, 196–204.CrossRef Google Scholar

McQuillan, R. & Krashen, S. (2008). Commentary: Can free reading take you all the way: A response to Cobb (2007). Language Learning & Technology 12.1, 104‒104.Google Scholar

Michel, J-B., Yuan, K. S., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwan, J., Pinker, S., Nowak, M. A., Aiden, E. L. (2011). Quantitative analysis of culture using millions of digitized books. Science 331, 176–182.CrossRef Google Scholar PubMed

Nation, P. (n.d.). About the mid-frequency readers. Available at www.victoria.ac.nz/lals/about/staff/paul-nation/ Google Scholar

Nation, P. (2006). How large a vocabulary is needed for reading and listening? Canadian Modern Language Review 63, 59‒59.CrossRef Google Scholar

Nation, P. (2012a). The BNC/COCA word family lists. Available at www.victoria.ac.nz/lals/about/staff/publications/paul-nation/Information-on-the-BNC_COCA-word-family-lists.pdf.Google Scholar

Nation, P. (2012b). What does every ESOL teacher need to know? CLESOL plenary for Language Teaching Community Languages and ESOL. Palmerston North, NZ.Google Scholar

Nation, P. (2014). How much input do you need to learn the most frequent 9,000 words? Reading in a Foreign Language 26.2, 1‒1.Google Scholar

Nation, P. & Anthony, L. (2012). Mid-frequency readers. Journal of Extensive Reading 1.1, 5‒5.Google Scholar

Park, G. (2004). Comparison of L2 listening and reading comprehension by university students learning English in Korea. Foreign Language Annals 37, 448–458.CrossRef Google Scholar

Porte, G. (2012). Introduction. In Porte, G. (ed.), Replication research in applied linguistics. Cambridge: Cambridge University Press, 1‒1.Google Scholar

Rubin, J. (1994). A review of second language listening comprehension research. The Modern Language Journal 78, 199–221.CrossRef Google Scholar

Schmitt, N. (2010). Researching vocabulary: A vocabulary research manual. Basingstoke: Palgrave Macmillan.CrossRef Google Scholar

Schmitt, N., Jiang, X. & Grabe, W. (2011). The percentage of words known in a text and reading comprehension. The Modern Language Journal 95, 26‒26.CrossRef Google Scholar

Schmitt, N. & Schmitt, D. (2014). A reassessment of frequency and vocabulary size in L2 vocabulary teaching. Language Teaching 47.4, 484‒484.CrossRef Google Scholar

Schmitt, N. & Zimmerman, C. B. (2002). Derivative word forms: What do learners know? TESOL Quarterly 36, 145‒145.CrossRef Google Scholar

Swanborn, M. & de Glopper, K. (1999). Incidental word learning while reading: A meta-Analysis. Review of Educational Research 69.3, 261‒261.CrossRef Google Scholar

Vandergrift, L. (2011). L2 listening: Presage, process, product and pedagogy. In Hinkel, E. (ed.), Handbook of research in second language teaching and learning (vol. 2). Oxford UK/New York: Routledge, 455–471.Google Scholar

van Zeeland, H. & Schmitt, N. (2012). Lexical coverage in L1 and L2 listening comprehension: The same or different from reading comprehension? Applied Linguistics doi:10.1093/applin/ams074.Google Scholar

Webb, S. & Rodgers, M. P. H. (2009a). The lexical coverage of movies. Applied Linguistics 30, 407‒407.CrossRef Google Scholar

Webb, S. & Rodgers, M. P. H. (2009b). The vocabulary demands of television programs. Language Learning 59, 335‒335.CrossRef Google Scholar

Wellington Corpus of Spoken English. Information at www.victoria.ac.nz/lals/resources/corpora-default/corpora-wsc.Google Scholar

Article contents

How much vocabulary is needed to use English? Replication of van Zeeland & Schmitt (2012), Nation (2006) and Cobb (2007)

Abstract

1. Introduction

2. The original studies and suggested approaches to replication

2.1 Lexical coverage of spoken discourse (van Zeeland & Schmitt Reference van Zeeland and Schmitt2012)

2.2 Vocabulary size (Nation Reference Nation2006)

2.3 Coverage, size, and learning (Cobb Reference Cobb2007)

3. Conclusion

Acknowledgements

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests