Small and large corpora in language learning

Guy Aston

Scuola superiore di lingue moderne per interpreti e traduttori
Università di Bologna
Corso Repubblica 136
47100 Forlì

April 1997

Learner access to corpora

There are two main practical applications of corpora in language learning which have been posited in the literature. In the first, which we might call the COBUILD approach, the corpus primarily constitutes a resource for the materials writer and teacher, in creating reference works and designing syllabuses and teaching materials. In the second, which we might call the data-driven learning approach, the corpus primarily constitutes a resource for the learner, either via printed concordances or hands-on concordancing of corpora. The distinction is summarised in a well-known quote in which Tim Johns advocates the DDL approach:

"What distinguishes the data-driven learning approach is the attempt to cut out the middleman ... and give direct access to the data so that the learner can take part in building up his or her own profiles of meanings and uses." (Johns 1991:30)

While the utility of corpus application from the first perspective is relatively uncontroversial (but see the debate between Sinclair 1991 and Widdowson 1991), whether you feel that giving learners direct access to corpora is a good thing depends largely on your view of the roles of teachers and learners. A recent article by Charles Owen provides a cautionary tale in this respect:

"Ah Peng, a research student from China, studying at a British university ... His latest piece of work contains the following sentence: Many more experimental studies require to be done before we can say that ... The teacher, British, educated in Britain, sure of his mother-tongue competence, has red-pencilled this and added a couple of notes. First, he suggests replacing require with need. Second, he says that if Ah Peng insists on using require then he could try Many more experimental studies are required before we can say that ... In other words, the verb require does not occur in the pattern attempted by Ah Peng ... Just for good measure, the teacher suggests that Ah Peng look it up in the ... Cobuild corpus ... So Ah Peng does this, and finds that require is indeed found in the passive ... However he also finds lines like this: decided that a large number of laws would require to be passed by a two-thirds majority. In fact, he finds more than a dozen of these ... Ah Peng greets the teacher next week with a triumphant gleam in his eye." (Owen 1996: 222-3)

This description highlights one of the main fascinations of the data-driven learning approach, namely that it potentially provides the learner with tools to challenge (or, if you prefer, lighten the weight of) the linguistic authority of the teacher. Owen does not make it clear whether his story is fact or fiction, but I can confirm that where I work, where we have for the last few months had access to the British National Corpus, we have had similar experiences: indeed one student told me that the best thing about the BNC was that she felt able to contradict her teachers.

Owen's reason for telling this story is not so much to highlight the revolutionary threat to the established order in language pedagogy that may be posed by learner access to corpora, as to stress that corpus use may not lead the learner to pedagogically appropriate generalisations. I have not checked COBUILD, but it seems probable that Ah Peng is overgeneralising from the few examples available, as well as wilfully ignoring the fact that the passive be required is by far the more frequent form. We might say that he has failed to appreciate that while extremely rich sources of information as to what typically is done in a language, corpora are much less reliable sources as to what can be done, offering no guarantee that the user will be able to induce abstract rules from the data correctly.

For Owen, the moral of this story is that corpus use must not lead teachers to abandon their own intuitive generalisations in pedagogy. I don't want to disagree with that: but equally, this doesn't mean that corpora should be abandoned either. It seems widely accepted that corpora can provide additional information about the ways language is used, complementing and indeed contradicting traditional sources, and helping to expand linguistic awareness. The point is that for this to take place effectively in a DDL context, learners need to use corpora appropriately, and this is not just a matter of technical skills in using concordancing software. It involves selecting appropriate corpora or subcorpora to interrogate, designing appropriate queries, and appropriately interpreting the results of those queries. Rather than listing hazards in all these areas, let me just cite one example which illustrates some of them fairly graphically. One of our students remembered encountering the word wight (meaning creature) in a story she had read. Was this word, she asked, used currently in contemporary English? The 100 million words of the British National Corpus seemed a reasonable place to look for an answer: a search for wight in the SARA index to the BNC in fact suggests that it is quite common, occurring 414 times. Only, of course, if you look at these occurrences, you find that they are virtually all references to the Isle of Wight. The simplest way to eliminate the noise and make the query more precise is to reformulate it as case-sensitive, thereby excluding cases with a capital W. This reduces the number of solutions to the 17 shown below:

$('wight') [the $ sign indicates case-sensitivity]

 1 am Gooch, by standing his ground and putting the wight on the front foot so that the head is over the fr
 2 He had felt a similar urge in the Barrow, as the wight's fingers came towards him, and there the temptat
 3 ith spells for the destruction of Mordor" in the wight's barrow.  `Glad would he have been to know its f
 4 osition of the noble dead in the barrow with the wight itself.  Does all glory decompose?  That is what
 5 /ht, xt/: for example,  with  (OE  wiht , PresE  wight ). 
 6                                     Thus, `with, wight" cannot appear as  wit, whereas `white" can.  To
 7  community, loss of the fricative and merger of  wight, white , or close approximation and overlap, had
 8 Weight.  [PS1SU]  What does it say?  [PS1SV]  Oh wight.  [PS1SU]  Right wight.  Quite wight.  So I I thi
 9 does it say?  [PS1SV]  Oh wight.  [PS1SU]  Right wight.  Quite wight.  So I I think you get bored with s
10 [PS1SV]  Oh wight.  [PS1SU]  Right wight.  Quite wight.  So I I think you get bored with some of this wh
11 r   With its blind song   That some sorry little wight more feeble and misguided than thyself   Take thy
12 th cystic fibrosis (five with liver disease) and wight control subjects.  Faecal samples were collected
13 relates to the need to er again I think maximize wight be a suitable word in here, er infrastructure.
14 o be ""buxom"", "obedient, biddable", to ""every wight"" while he is away in Flanders.   Linguistically
15  different from the readership of ""every gentil wight"" that is offered a warning and an invitation to
16  conform to the naive stereotype of the ""gentil wight"" - to share this insight.  |  ^ Exactly how far
17 imity of opinion directs the laughter of ""every wight"" at the end of the Miller's Tale (3847-9), the t

This concordance makes two things evident. The first is that the numerical datum alone still does not provide a reliable answer to the question posed. The second is that an enormous range of background knowledge, cultural as well as linguistic, has to be brought to bear in order to interpret these instances. Even after eliminating the Isle of Wight, we still have to distinguish quotations from medieval literature (Chaucer's "every wight" and "gentil wight": lines 14-17), mention as opposed to use in a historical linguistics textbook (5-7), potentially deliberate archaism in a poem by Ted Hughes (11), punning in conversation (8-10), and a range of misprints (1, 12, 13), as well as references to J.R.R. Tolkien's Lord of the rings and its pseudo-Old English (2-4). Much of this knowledge is likely to be beyond all but the most advanced and capable of second language learners. Ah Peng is not alone in being potentially misled by the abundance and variety of data available from a large corpus of this kind. Within a DDL approach, some of the first problems to arise concern the kinds of corpora learners should have access to, and how they can be helped to improve their ability to use them.

Small or large?

The orthodox view on corpus size is the larger the better (Sinclair 1991). On the one hand, most types of linguistic events are extremely infrequent, as Zipf's law reminds us: even a general corpus of the size of the BNC provides very limited documentation of a word as infrequent as wight. On the other, since it is impossible to reliably sample textual populations whose extent and composition are unknown, increasing size is an obvious countermeasure: while there is no theoretical warrant for treating frequencies in a corpus like the BNC as reflecting those in English "as a whole", it can be argued that inasmuch as it is a large corpus which aims to provide broad coverage of language uses, figures from it will probably provide more useful general indications than ones from smaller and/or more specialised corpora. From the perspective of data-driven learning, however, the virtues of large corpora seem less readily apparent. Small corpora would seem to be both useful as instruments of language learning in their own right, and as means of training learners to use corpora appropriately.

While by large corpora I mean collections such as the Bank of English and the BNC, with 100 million words or more, by small corpora I mean ones in the 20,000-200,000 word range - smaller, that is, by several orders of magnitude. A glance at a list of small corpora provided by Flowerdew (1996) shows that as well as their size, what typically distinguishes such corpora from large ones is that they are far more specialised, by topic, by genre, or both. Putting together corpora which consist of so few texts, homogeneity becomes the watchword: with the exception of the Byte and New Scientist corpora, all those listed are under 250,000 words.

Some small corpora for language learning (Flowerdew 1996: 101; from Ma 1993)

King 1989                                       Academic lectures/tutorials            155k 
King 1989                                       Scientific/technical journals          114k 
Tribble/Jones 1990                              ELT text pack corpus                    45k 
Tribble 1990-91                                 English historical review corpus       105k 
Tribble 1990-91                                 Longman Learner corpus                  55k 
Mparutsa et al 1991                             Economics corpus                        21k 
Mparutsa et al 1991                             Geology corpus                          34k 
Mparutsa et al 1991                             Philosophy corpus                        7k 
Johns 1988                                      Transportation/highway engineering     100k 
Johns 1988                                      Plant biology corpus                   100k 
Johns 1991                                      New Scientist corpus                   760k 
Johns 1991                                      Byte corpus                           1000k 
Flowerdew 1993                                  Biology lectures and readings          104k 
Ma 1993                                         Direct mail sales letters               16k 

Large corpora, on the other hand, typically aim at a broad general coverage of language production. Even where they can be divided into subcorpora, those subcorpora are still pretty general categories. For instance, one of the small corpora we have been working with in Forlì is a collection of 14 articles dealing with hepatitis C from medical research journals: a total of 43k words. About the closest identifiable category of texts in the BNC is that of "written informative texts in applied science" (364 texts: 7.3 million words). Notwithstanding this enormous numerical difference, the number of texts of the specific type represented in the small corpus is in fact greater than the number of texts of that type contained in the BNC, where I have only found seven academic research articles dealing with hepatitis of any kind. If the learner's objective is to become familiar with a highly specific text-type (as is often the case in translating and interpreting work, as well as in LSP), even a very small specialised corpus may provide more plentiful documentation of many features of that type than can a large general one.

Rather than just at the level of content, however, it is in terms of the methods they require and allow that small, specialised corpora offer advantages from a DDL perspective.

From small to large

The emphasis in this paper on small specialised corpora is to some extent a product of our particular concern in Forlì with training interpreters and translators, who need to be familiar with the conventions of the specific type of text they have to translate. In corpus linguistics, it has been claimed that specific corpora provide better training material for genre-specific NLP applications than do general ones (Biber 1993), and Biber and his colleagues have extended this argument to ESP:

"The markedly different patterns of linguistic form and function that occur across registers indicate that there is no single set of linguistic features that should be emphasized for all students, once they have mastered the rudiments of English grammar. Rather, it is important to teach the linguistic characteristics and functions of particular target registers, so that students will be able to control the language structures they encounter in actual discourse and to adjust their language use appropriately for different registers." (Biber et al 1994: 174)

This does not mean, however, that particular recurrent features in small corpora are necessarily restricted to that particular text-type. In this respect the small corpus provides a resource for generating hypotheses which can then be tested against larger corpora to ascertain to what extent a feature is generalised. Thus a learner investigating the hepatitis corpus may discover the recurrence of patients (who) developed hepatitis:

`developed' in the Hepatitis C corpus (1/3 solutions)

 1 ed hospitalized patients in the United States developed a hepatitis-like picture, our reported frequency
 2  Results [P] During the follow-up 78 patients developed ALT elevations higher than 2.5 times the upper l
 3 recipients in whom non-A, non-B hepatitis had developed an average of 18 years earlier is virtually iden
 4 e cases, before the onset of disease) and who developed anti-HCV in the first year after the hepatitis b
 5 l" patients closely followed up to six months developed any abnormality of transaminase activities after
 6 . Acute hepatitis occurred in 31 patients; 55 developed chronic disease. The remaining four patients had
 7  However, not all patients with high ALT [P]  developed chronic disease.  The peak ALT elevation was ass
 8 presence of shock or malignancy. Patients who developed chronic hepatitis did have higher mean ALT level
 9 ocumented hepatitis C exposure and the 50 who developed chronic hepatitis C were 17% and 20%, respective
10  chronicity. 21 of 65 (32%) biopsied patients developed cirrhosis at the end of the follow-up, and one f
11 and mild CAH (4).  Patients in this study who developed cirrhosis were about ten years older than those 
12 dice. None of those in whom chronic hepatitis developed continued to have significant symptoms. Hepatiti
13 eight patients (seven with chronic hepatitis) developed hepatic failure; and 3) life-table analysis show
14  transfusions of factor VIII. Of these, eight developed hepatitis (table). Of the seven patients who had
15 ansfusions of factor VIII. Three patients who developed hepatitis had previously received 11,15, and 19 
16  blood samples tested from those patients who developed hepatitis confirmed that all episodes of hepatit
17 tomatic illness. In 12 of the 17 patients who developed hepatitis the illness had an incubation period o
18 o received NHS factor VIII for the first time developed hepatitis, while eight out of the 15 who had rec
19 d had only NHS factor VIII in the past, three developed hepatitis. Eight patients had had both NHS and c
20  between 1 and 24 yr (mean=9.7 yr). Cirrhosis developed in 8 patients (20%) between 1.5 and 16 yr after 
21 r VIII units previously transfused. Hepatitis developed in all those nine patients, however, who had not
22                    RESULTS [P] NANB hepatitis developed in 65 of 1,070 (6.1%) patients who received tran
23  transfusion in the eight patients in whom it developed. It is noteworthy that although some patients pr
24  samples taken four weeks apart. Ten patients developed jaundice, and six of these were acutely ill (fig
25 ANSFUSION HEPATITIS.[P] [P]  135 patients who developed NANB post-transfusion hepatitis mostly after car
26 ). Of the 1151 recipients studied, 106 (9.2%) developed non-A, non-B hepatitis. [P]     To assess the re
27 han the number of units given to patients who developed non-A, non-B hepatitis who received blood that w
28 Netherlands 3.3 % (13) of transfused patients developed PHT. As previous studies (8) have shown that 2.2
29  (3%). About half of the cases with cirrhosis developed portal hypertension, and three of these died due
30 ere probably infected before transfusion. one developed probable and the other possible hepatitis.Both s
31 ansferase (ALT) within 48 h. The patients who developed symptoms or ALT elevations above 2.5 times the u
32 enzyme-linked immunosorbent assay (ELISA) was developed to detect circulating antibodies to the C-100 re
33 ed transfusions but in whom hepatitis had not developed. We present the results of the mortality survey;
34 ransfusions and in whom non-A non-B hepatitis developed with those in matched control groups of persons
The same learner can then turn to the BNC to investigate what other kinds of things may be developed by patients - complications, recurrences, and a wide range of other unpleasant and unpronounceable ailments:

(('patients')*('developed'))/5: British National Corpus (alternate solutions)

 1 ervices for attempted suicide patients have been developed. 
 2 nly been used in patients whose cancers are well developed. 
 3                        Over half of all patients developed a clinically important complication in the f
 4 However, comparing the subgroup of patients that developed abdominal pain with the subgroup that did no
 5 vett and colleagues reported on two patients who developed acute respiratory obstruction from swallowin
 6                   None of Vento's three patients developed AIDS during IFN therapy.
 7                                     Two patients developed an acute psychosis: the first, who had just 
 8 14 years) before this study.  Three patients had developed anastomotic disease recurrence and had requir
 9 e interferon treatment.  |  Most of the patients developed ascites, jaundice, and encephalopathy that p
10  18.2 years (range 726), six men.  Six patients developed cancer under the age of fifty.
11 a mean of 2.4 months, two (5%) of these patients developed cholangitis. 
12 nically `silent" period, the patients eventually developed cholestasis. 
13 t records that document all colitis patients who developed dysplasia, cancer, or had colonic surgery sin
14 ne death related to the procedure.  Two patients developed encephalopathy after TIPSS, in one patient th
15  the follow up period except in two patients who developed epithelial hyperplasia within the stent resul
16         Treatment was discontinued when patients developed evidence of cirrhosis or portal hypertension
17            Trial criteria  - Twenty one patients developed gall stone recurrence
18  gall stone recurrence  - Of the 21 patients who developed gall stone recurrence by the original criteri
19 nty three of these patients had, or subsequently developed, intrahepatic and/or extrahepatic bile duct 
20 efore the scheduled appointment. 4 patients also developed level-1 or in-situ melanomas; 3 of these pat
21 low up period, 22 (40%) of the 55 polyp patients developed metachronous polyps.
22 o in the placebo groups.  Excluding patients who developed pancreatitis, 43 (18%) developed abdominal p
23  and we have classified patients as having  de - developed postoperative sepsis if they have developed t
24 wo patients taking hormone replacement treatment developed recurrent stones.
25 ch facilities for attempted suicide patients are developed should endeavour to monitor trends in the beh
26 olerated in all but one patient.  Three patients developed strictures after radiotherapy but all were de
27                         Nine of the patients had developed the virus, and three had died [see p. 39161 f
28 aried from 39 to 64% (mean 57%).  Three patients developed ventricular fibrillation and pacing was requi 

Similarly, investigation can focus on observed absences in the small corpus. Noting the absence of the passive developed in the hepatitis corpus, the learner can turn to the BNC to investigate whether and where such passives do occur - i.e. what sorts of things are in fact developed - not illnesses, but approaches, models, concepts, devices, programmes, languages and the like, not to mention film:

('were')("developed"=VVN): British National Corpus (random 25 solutions)

 1 earliest Interactive Multimedia Terminals (IMTs) were developed using videodisc technology and, even tod
 2         Patterns like the `fix" and the `fiddle" were developed by skilled workers, in piecework systems
 3                        Thus, if such an approach were developed, one might hope that all the essential c
 4 , very efficient and very effective day centres, were developed, funded, provided mainly in the conurbat
 5   Various strategies, individual and collective, were developed to combat the organizational pressure.
 6 tives would be achieved more readily `if courses were developed on a unit basis". 
 7  proved too severe for the courts and exceptions were developed to it.  
 8 came a girls' school until 1980 when the grounds were developed for housing. 
 9        Quite complicated tooth-and-socket hinges were developed between the valves. 
10                    If this notion of integration were developed within our total society, the mentally h
11                                           KOMIKs were developed as a result of the work of SIDT's SEI! t
12                      First of all, new languages were developed for the scientist and engineer who did n
13 iques and crafts seen in the Early Minoan period were developed further. 
14 of matrix management and technoeconomic planning were developed and introduced, with each establishment
15                    The larger troops, plausibly, were developed as protection from diurnal predators. 
16 s time updating and regular reporting procedures were developed and the necessary programs written. 
17              For the rural population programmes were developed that encouraged families to produce crop
18 rly macroeconomic models with quantity rationing were developed by Solow and Stiglitz (1968), Barro and 
19         Long spines on the exterior of the shell were developed especially during the Carboniferous. 
20                       Devices, such as sundials, were developed to help keep track of these movements, a
21           The way in which the goals and targets were developed in Australia also differs noticeably fro
22                                            These were developed from qualities defined by Isaiah (c.11)
23                                             They were developed in Kodak D19 developer at 17°C, counters
24 iscussions of these questions, in so far as they were developed in argument, for that would lead the Boa
25 guage and of the literary text as a whole, which were developed under the joint influence of Saussure an 

Such a procedure would appear to facilitate the use of large corpora in a number of ways. In the first place, learners investigate the large corpus only when they already have some sort of criterion for classifying the data - in this case, looking for categories of unpleasantness that patients develop, and categories of things that are developed. Unlike Ah Peng, who is turning to the COBUILD corpus simply to find out whether the form require to + passive infinitive occurs, with no idea how eventual occurrences of it might be classified, the learner here starts from the discovery that a feature is (or is not) used in a certain way in a small corpus, and examines the large corpus to see if it is or is not used elsewhere in the same or in other ways. Johns (1991) descibes DDL as involving three phases: observation, classification and generalisation. Here, roughly speaking, we might say that the mainly inductive work with the small corpus provides the basis for more deductive work with the large one. In the second place, it is worth noting that the initial work with the small corpus will tend to generate hypotheses which will, other things being equal, tend to be more precise, requiring a smaller number of solutions to be dealt with from the large corpus. Take the case of yesterday, and its absence as the first word of news flashes. I do not care to imagine any learner simply looking up the word yesterday in the BNC to find out how it is used, given the number - almost 20,000 - and variety of occurrences. A much more manageable query, in contrast, is to look for paragraph-initial occurrences in written texts dealing with world affairs, of which there are 132 in the BNC. An investigation of their sequential environments and the texts they come from confirms that yesterday virtually never begins a newspaper or periodical article of any kind, except as a genitive in the form Yesterday's X.


To sum up, I would suggest that work with small specialised corpora can be not only a valuable activity in its own right, as a means of discovering the characteristics of a particular area of language use, but also an instrument to help and train learners to use larger ones appropriately. Perhaps Ah Peng would have done better to look in a small specialised corpus before turning to Cobuild: for the record, the hepatitis C corpus provides no examples of active require followed by a passive infinitive. It does on the other hand provide examples of be required, one of which (line 10) is remarkably close to the case in question:

require/required/requires/requiring (Hepatitis C corpus: MicroConcord)

 1  long-term follow-up (probably c2-3 years) is required. A serologic correlate for chronicity would b
 2 s or who were immunosuppressed, (f) who might require blood transfusion after admission to the study
 3 ur studies, at least two abnormal values were required on measurements separated by at least three o
 4 sician can determine what therapy, if any, is required. [P] [P] Limitations and strengths [P] [P]  I
 5 ic assays  to detect HCV infection (6,7) have required the reexamination of previous studies of post
 6 tudies. A diagnosis of non-A, non-B hepatitis required the absence of hepatitis-B  surface antigen a
 7 another NANB agent can be interpreted without requiring the existence of such an agent or agents[13,
 8 .0 results. [P] [P]  Several assumptions were required to construct the statistical analysis of the 
 9 fections. PCR-based sequence analyses will be required to distinguish between these hypotheses. Howe
10 ith antiviral therapy. Additional studies are required to determine whether the response to such tre
11 pproximately 10yr. Prolonged follow-up may be required to fully assess the impact of posttransfusion
12 clinical hepatic failure as the end point, is required to establish clinically meaningful efficacy. 
13 e attending the Oxford Haemophilia centre who required treatment with factor VIII, were not suspecte
Noting the absence of active uses here might at least have encouraged Ah Peng to treat the Cobuild examples with greater care, looking for ways in which they were different from, rather than similar to, the example he had produced.

If we look up active forms of require followed by passive infinitives in the BNC, we can in fact see that these tend to occur in legal documents and in instruction manuals, and generally in negative, modalised, or conditional contexts. A prototypical example being Anti-glare screens require to be fitted where needed. I would not however recommend this search, any more than the one for wight, to any learner who had not acquired considerable skills in using corpora, having learned to approach the data with motivated hypotheses, and to beware of over-generalising. Working with small corpora seems one way of helping learners to acquire these skills.