1999. Textus 12: 289-314.
Given the fact that total bi-directional correspondences are extremely rare phenomena, we often have to search for second-best matches, and that means we have to select one of several alternatives, namely the one that fits the context best. [...] It should be possible to come up with this match if the translator consults a large corpus [...] and, by identifying the context pattern in question, finds the lexical unit that would `naturally' be used in such a situation. All it needs is an operational definition of context and context pattern. (Teubert 1996: 241)
A paper on the learning of translation must espouse some view of what translation involves. So let me state some premises. Following Joseph (1998), I take it that translation involves interpreting a source text (ST), and then generating a target text (TT) in another language which strategically directs its intended audience to an interpretation of it - generally one which in certain respects matches the interpretation given to the source text. From this point of view - substantially corresponding, as I understand it, to Nida's notion of "dynamic equivalence" (1964) - would-be translators must develop interpretative and strategic competencies which they may well lack in at least one of the languages involved for a particular task, since translators are rarely balanced bilinguals, nor always specialists in the discourse domain in question. In addition, translating - like editing - calls for the ability to elaborate, compare and evaluate different strategies and interpretations in the light of externally-defined contextual restrictions. Translators typically work under commission, where specific target audiences, and specific interpretations of the source and/or of the target text are implied (Reiss 1981).
The translator thus needs resources which can suggest possible and probable interpretations of the ST, which can indicate effective strategies for achieving particular interpretations of the TT, and which can facilitate the evaluation of alternative strategies and interpretations. Varantola (1997) suggests that as much as 50% of the time spent on a translation can be dedicated to consulting reference materials. In this paper I review the roles which can be played by electronic corpora in improving the quality and speed of the translation process, in helping would-be translators to develop their interpretative and strategic competence, and in developing their sensitivity to the issues involved. While in no way wishing to suggest that electronic corpora are a touchstone to resolve the translator's many problems, I believe that they can satisfy three significant criteria for translation instruments:
se facilitano il processo e portano ad una migliore qualita' del prodotto, anche attraverso un aumento delle possibilita' di scelta dell'utente; se offrono occasioni di apprendimento linguistico e metalinguistico; se permettono lo sviluppo di una capacita' tecnica e critica nei confronti di simili strumenti. (Aston 1996: 308)
Interest in corpora in the field of translation has been from two main perspectives, descriptive and practical. On the one hand, scholars have designed and analysed corpora of translations, comparing these with corpora of original texts in order to establish the characteristics peculiar to translations in particular SL-TL combinations (Gellerstam 1996), and indeed possible universals distinguishing translated texts (Baker 1998, Laviosa 1998). On the other hand, there has been a growing interest in corpora as aids in the processes of human and machine translation - their role which is my primary concern here. For this purpose, three main types of corpora have been proposed as relevant:
Figure 1: Comparable and parallel corpora language A language B Comparable A. specialized corpus B. specialized corpus of same design Parallel A. specialized corpus B. translations of texts contained in A Unidirectional Parallel A1.specialized corpus B1.specialized corpus of same design as A1 Bidirectional A2.translations of B1 B2.translations of A1
The obvious way in which corpora can help translators is as reference tools, as complements to traditional dictionaries and grammars. Thus the first sentence of Bruce Chatwin's Utz (1989a: 7) reads as follows:
An hour before dawn on March 7th 1974, Kaspar Joachim Utz died of a second and long-expected stroke, in his apartment at No. 5 Sirok Street, overlooking the Old Jewish Cemetery in Prague.Let me focus on just one problem here, the translation of overlooking. If we examine the occasions where the word apartment occurs in the vicinity of overlooking in the 100-million word British National Corpus (http://info.ox.ac.uk/bnc), we find that apartments typically overlook mountains, rivers, oceans, ports, squares and gardens - all views which seem positively connotated. On the few occasions where what is overlooked is ugly, irony appears to be intended, as in:
Not only do they tolerate the fast-food shops serving up nutriment that top breeders wouldn't recommend for Fido, they go as far as purchasing two expensive weeks in a gruesome timeshare apartment, and sit smoking all day on a balcony overlooking the A9.The corpus data thus suggests that overlooking has a positive semantic prosody (Sinclair 1991, Louw 1993) - a fact which is unmentioned by dictionaries, and might even be overlooked by a translator whose native language was English. It aids interpretation of the ST, raising the problem of whether Chatwin intends the Prague cemetery to be seen by the reader as a beauty spot, or whether he is being ironic - or indeed, whether he simply aims at ambiguity in this respect.
A corpus can also help the translator evaluate - or indeed come up with - a possible translation for this sentence. The Italian translation of Utz (Chatwin 1989b: 9) renders it as:
Il 7 marzo 1974, un'ora prima dell'alba, nel suo appartamento di via Sirok 5 che dava sul vecchio cimitero ebraico di Praga, Kaspar Utz mori' di un secondo colpo da tempo previsto.Does the choice of dava su share the positive connotations of overlooking, and allow a similar, possibly ironic, interpretation? In a small (2 million word) collection of Italian literary texts, we find the following instances of dava su:
Lei non si vedeva. Ma il soggiorno dava su una veranda da cui una scaletta o negli onesti. La finestra di mezzo dava su un balcone di ferro. Concentr finestrone, dai vetri impolverati, che dava su di uno spalto esterno, da cui si la vasca. Al chiaro di una finestra che dava su un cortile interno, le sensazio be potuto uscire subito dalla porta che dava sul sottoponte. Ma, quasi a prende Muovendosi davanti alla vetrata che dava sul parco, il Bocchi vide i globi omandante aveva una grande finestra che dava sul pozzo a lume; di fronte, con un Bocchi abitava in un piccolo attico che dava sul Lungoparma, nel punto in cuiThese citations offer little evidence that dava su has a distinctive prosody, and make it doubtful that this translation could be interpreted as ironic.
Data from monolingual corpora may thus support interpretative and strategic hypotheses, or suggest that they should be rejected. They may also suggest alternative hypotheses. In the English corpus, overlooking tends to be associated with a particular set of collocates (garden, sea, hills, square etc.). If we search the Italian corpus for occurrences of equivalents to these collocates (giardino, mare, montagna, piazza, etc.) in the vicinity of words like appartamento, camera, casa and finestra, we find citations such as the following:
se vuole posso prenotarle una camera per domani stesso, una bella e linda cameretta con vista sul mare, vita sana, bagni di alghe, talassoterapia,This citation suggests another possible translation of overlooking, namely con vista su. As we did with dava su, we can now test this against the corpus in order to see whether it is positively connotated, and whether there is evidence of its being used ironically - whether, that is, it occurs in similar contexts to overlooking.
A monolingual general corpus also provides a rich language learning environment. Even if the dava su hypothesis is rejected, the process of doing so allows the user to learn much which may be of value in the future. Unlike the dictionary, a concordance leaves it to the user to work out how an expression is used from the data. This typically calls for deeper processing than does consulting a dictionary, thereby increasing the probability of learning (Hultsijn 1992). In more general terms, by drawing attention to the different ways expressions are typically used and with what frequencies, corpora can make learners more sensitive to issues of phraseology, register and frequency, which are poorly documented by other tools.
Corpora also allow much unpredictable, incidental learning. Almost any concordance is likely to contain unknown or unfamiliar uses, which may be noticed and explored by the user who is prepared to go off at a tangent to follow them up (Bernardini 1997, in press). Looking through the occurrences of dava su, I noticed the unfamiliar expression pozzo a lume. While I can roughly understand its meaning from the context, I may be able to get a better idea of its use and frequency by generating a concordance of all its occurrences in the corpus.
As translation aids, however, monolingual general corpora pose a number of difficulties:
These difficulties can be reduced by using corpora which are specialized, that is, which consist only of texts of a type similar to the ST and/or the desired TT. Such corpora may be extractable as sub-corpora from large general ones - though only limited specialization can be obtained without compromising representativeness (Sinclair 1991) - or they may be specifically collected - an investment which may be well worth the effort where the translator foresees doing a number of similar translations in the future, and which is therefore a useful exercise for any translator training course (Maia 1997).
Specialized corpora can be seen as a development of the tradition of using "parallel" texts in translation - i.e. collections of texts of the same kind as the ST and/or TT (Haartman 1980; Williams 1996) - with electronic format enabling more rapid and systematic searching of larger quantities of text. Such corpora are particularly useful for the investigation of forms and meanings which are typical of that type of text (in particular terminology, but also features of register and text structure: Gavioli, forthcoming; Zanettin, forthcoming), and as an environment in which to prepare for work which has to be carried out under time constraints, such as speech interpreting. Varantola (1997) underlines how specialized corpora have high "reassurance value", particularly where the TT is in the translator's L2, insofar as they illustrate similar contexts to those of the translation being worked on.
Where specialized corpora have to be constructed by the user, this involves design decisions as to what texts to include and why. One of our early experiences in Forl with specialized corpora involved learners who were translating material for the Melozzo centenary exhibition into English, for which we compiled English and Italian corpora from CD-ROMs of the National Gallery and of the Uffizi. Each corpus contained texts of similar types describing artists and their works, genres, schools and technical aspects for a lay public. While limited in size (under 100,000 words each), their specialization and authoritativeness made them appropriate resources for the task, and given their similar composition, the two corpora could also be treated as comparable. Today, corpora of texts of this type could also be compiled from the Internet (Pearson, forthcoming). Clients are also a potential source of relevant specialized texts.
With respect to general monolingual corpora, specialized ones are easier to handle and in many ways more informative. In particular:
N File Words Hits per 1,000 words
1 sx.11 1,436 17 11.84 2 rx.11 965 11 11.40 3 mx.11 914 8 8.75 4 rx.7 870 3 3.45Reading the first of these, we find the expression fatal liver disease - a translation hypothesis which we can then investigate using the entire corpus.
While most work involving specialized corpora as translation aids has used TL corpora (Bowker 1998, Varantola 1997, Friedbichler and Friedbichler 1997), where comparable specialized corpora are available, these can also be used to investigate the SL and the ST, particularly where the conventions of the latter are relatively unfamiliar, as a means to identify routine and non-routine uses. Comparable corpora seem particularly useful for learning purposes, as a means of exploring a particular text- type in both languages prior to engaging in translation.
Since specialized corpora for a particular text-type are rarely available off-the-shelf, the translator needs to learn to construct such corpora - an experience which will develop awareness of their potential validity and reliability. Collecting a reasonably representative set of texts of a particular type requires a preliminary survey of the textual population and of its variability, as well as of the authoritativeness of candidate texts. Friedbichler and Friedbichler (1997) recommend selecting texts which have been subject to peer review, and which are where possible widely cited in the specialist literature (note 4); Varantola (1997) recommends avoiding texts written by non-native speakers.
It is clear that for any specialized corpus, the greater the variability of the text- type to be represented, the larger the corpus should be. In general, the larger the better, though there is clearly a point where the returns on expansion diminish. Friedbichler and Friedbichler (1997) suggest that for English, authoritative specialized corpora of 500,000 to 5 million words (according to the variability of the text-type) should provide solutions to 97% of the translator's questions. In what follows, a number of criteria for evaluating specialized corpora are proposed: in each case, the smaller the value the better.
All these measures are a function of variability within and across texts, and of corpus size (and in the case of the last measure, also of text size): a small but homogeneous corpus of weather reports may well have lower values than a much larger one of tourist guides. Values will also depend on the language of the texts: given the greater morphological complexity of the language, Italian corpora tend to have higher values than English ones (note 5).
The translator can use measures such as these to assess the reliability of a particular specialized corpus and hence to determine its required size. Values obtained on the last two measures can also be compared with the actual proportions of undocumented types encountered in the ST and/or TT, as an indication of the "goodness- of-fit" of the corpus for the text in question.
This fit will rarely be perfect, and in any case no specialized corpus is ever likely to document all the problems posed by a particular text. Specialized texts also use non- specialized language, and the intertextual background on which they draw will rarely be simply that of the text-type in question. There thus remains the need to recognize where general monolingual corpora should be called on, or where it may be useful to compile a corpus ad hoc to analyse a particular problem.
Specialized corpora will rarely document every word in an ST or TT, even if they are likely to provide a much fuller documentation for features typical of that text type than large general ones. One learner using a comparable specialized corpus on cancer of the colon in order to translate an English research article into Italian was completely nonplussed by an allusion in the ST to the holy plane, for which she could find no explanation or equivalent. In such cases, relevant information may be obtainable from a large monolingual corpus or, failing that, CD-ROMs or the Internet. We can in fact use the Internet to compile corpora ad hoc, using search engines to find all the texts containing particular expressions. Since the world-wide web is an ever- changing entity of dubious authority whose overall composition is unknown, considerable care must however be exercised in selecting texts and drawing inferences (Pearson, forthcoming).
The value of such ad hoc corpora can be illustrated by an example from Bertaccini and Aston (forthcoming), which focusses on the translation into English of a French newspaper article which contained the word clochemerlesques. Searches were made for clochemerl* in a CD-ROM of Le Monde, and using the Altavista search engine on the Internet (http://altavista.digital.com). Together, these turned up 20 French texts, analysis of which allowed for a fairly confident interpretation of the ST: Clochemerle was a comic novel by G. Chevallier which ridiculed factionism in village politics, apparently well-enough known as an archetype of petty factionism to be alluded to without explanation by French journalists.
How could it be translated in English? Searches for English examples of clochemerl* on the Internet, and in CD-ROMs of The Independent and The Daily Telegraph, suggested that Clochemerle was far from equally familiar to a British public, and that it was if anything associated with public conveniences. Did any archetype in British culture have similar associations to the French one? One possibility which came to mind was Gulliver's Travels, and the conflict in Lilliput between Big- and Little-endians as to the right way to crack an egg. However, further searches provided no evidence that reference to Lilliput, or to big /little-endian, would have these associations for a general reader (the former seemed associated exclusively with size, and the latter were terms in computer architecture). The final (if less than fully satisfactory) solution was local squabbling, whose derogatory connotations were confirmed by a study of the semantic prosody of squabbl* in the BNC.
In such cases, an ad hoc corpus is clearly better than none, though very time- consuming to compile. Friedbichler and Friedbichler (1997) suggest that to be cost- effective, searches using corpora should not exceed an average of ten seconds: so the use of ad hoc corpora must be limited to a very small proportion of the problems posed by any translation.
A further limit of monolingual and comparable corpora as translation tools is the difficulty of generating hypotheses as to possible translations. The user must rely on known or suspected equivalences as heuristics to retrieve similar contexts in a TL corpus, providing a specification which is both sufficiently general to recall a range of possibilities, and sufficiently precise to limit the number of spurious hits. S/he must then verify that the citations retrieved are in fact sufficiently similar to those of the ST and/or the SL corpus. These procedures are both time-consuming and error-prone: an expression in the TL corpus may occur in a similar context to one in the SL corpus, yet in fact mean something different. For example, in attempting to translate the phrase loop ileostomy in a medical research article, Ferri (1999: 64) illustrates how a search for similar contexts in the TL found ileostomia su bacchetta. Without detailed medical knowledge, she initially assumed this term to be equivalent, while it is in fact hyponymous.
Greater certainty as to the equivalence of particular expressions can be obtained by using parallel corpora, consisting of original texts and their translations, where these are similar to the ST and TT. If the corpus is aligned, and suitable software is available, the user can locate all the occurrences of any expression along with the corresponding sentences in the other language.
There is however a dearth of parallel corpora for English and Italian, and relatively little parallel concordancing software for the PC (though see Barlow 1995, Woolls 1997). The examples which follow were extracted using Multiconcord (Woolls 1997), from its sample collection of different language versions of discussions in the European Parliament. This material has many limits, since we do not know which version constitutes the original text, and which a translation, or indeed a translation of a translation (Lauridson 1996). Nevertheless, it can illustrate how a parallel corpus may provide a means of identifying translation hypotheses in a specialized environment.
The following concordance shows occurrences of the word establish and its equivalents in Italian (some citations are abbreviated for reasons of space):
We support the Socialist Group's demand for the President to establish a committee as soon as possible to conduct such a review. Condividiamo la richiesta del gruppo socialista in base alla quale il Presidente dovrebbeThis illustrates a wide range of possible equivalents to establish: avviare, creare, elaborare, ginstaurare, realizzare, verificare. For the translator of an English text of this kind, it thus suggests a range of hypotheses which can be further investigated using a general or specialized TL corpus.
istituirequanto prima un comitato per la realizzazione di questa modifica. if we are to guarantee the quality and competitiveness of the European tourist industry, we shall have also to develop new forms of synergy with other Community policies, bringing in all of the interested parties in an effort to establish the conditions favourable to the development of the Union's tourist enterprises per garantire la qualita' e la competitivita' dell'industria europea del turismo, occorre inoltre sviluppare nuove sinergie con le altre politiche comunitarie, coinvolgendo tutte le parti interessate al fine di crearele condizioni favorevoli allo sviluppo delle imprese turistiche dell'Unione Thus we need to establish a coherent European tourism policy which adds value above and beyond Member State level and against which we can judge and monitor the very considerable sums of money which are spent through other EU funds ed e' quindi necessario realizzareuna politica europea per il turismo globale, che aggiunga valore al di sopra ed oltre il livello di Stato Membro e rispetto alla quale possiamo valutare e controllare le notevoli somme di denaro che vengono spese attraverso altri fondi europei It is vital at this point that we establish diplomatic relations and therefore a dialogue with the current Kabul authorities, Si rivela indispensabile in questo momento, instaurarerelazioni diplomatiche e quindi un dialogo con le attuali autorita' di Kabul, It must put an end to the inconsistencies and finally establish a clear and independent foreign policy, at last shouldering its responsibilities, without hesitation and avoiding inconsistencies. Metta fine alle sue contraddizioni e elaborifinalmente una politica estera chiara, autonoma, si assuma finalmente le sue responsabilita', senza tentennamenti e senza contraddizioni. We must ask the Union to establish whether the proposals made by these countries under the aegis of IGADD will be able to bring about a solution and if so to give them our support. Invitiamo l'Unione a verificarese le proposte avanzate da questi Stati nell'ambito dell'IGAD siano tali da favorire una soluzione e, in caso positivo, la sollecitiamo a dare il suo sostegno. We need more specific signs and we need clearer evidence that the Belarus Government does indeed want to establish a free and more democratic society. Ci servono segni piu' precisi, cosi' come deve essere precisa l'intenzione del governo bielorusso di instaurarea tutti gli effetti un sistema libero e democratico.
Not all expressions are paralleled by such a wide variety of equivalents. One of the most frequent lexical words in the Italian component of the corpus is relazione. The parallel English term is invariably report (unlike the British parliamentary paper). In contrast, under a third of the occurrences of another frequent word, favore, are paralleled by favour: parallel to votare a favore di we find vote for; parallel to accogliamo con favore, we welcome. The corpus suggests equivalents for technical terms, and a wider variety of possible translations for sub-technical lexis than are likely to be found in a bilingual dictionary, particularly at a phraseological level. It may also highlight syntactic contrasts, including differences in the organization of the text into sentences and paragraphs.
Using such a corpus can also have a positive impact on learning. Where a variety of parallel realizations are encountered, this may help learners to distinguish between different contexts of use, and reduce their tendency to think in terms of one-to-one equivalence, as Ulrych (1997) illustrates in respect of parallel English realizations to ossia. More general problems may also be faced: Danielsson and Mühlenbock (forthcoming) illustrate how a parallel corpus can cast light on translation strategies for proper names, showing whether these are transcribed, translated, clarified or simplified. Johns (forthcoming) proposes a number of types of exercises using parallel concordances, for instance by blanking out the search word in language A and asking learners to infer it from the parallel citations provided in language B.
Since parallel concordances provide translations of each occurrence, citations are more likely to be immediately understandable for the user, diminishing the difficulties of retrieval and risks of misinterpretation associated with monolingual and comparable corpora. For the same reason, the scope for incidental learning may be increased. However, notwithstanding their apparent face validity, parallel corpora also introduce new dangers deriving from the assumption that parallel occurrences are effectively equivalent. It is necessary to ask whether the translations in the corpus are reliable and authoritative (note 6), and to bear in mind that the use of translations to identify equivalents inevitably implies "reduc[ing] the target language to a mirror image of the source language" (Teubert 1996: 250) - or the SL to a mirror image of the TL:
There is, for instance, no direct T[ranslation] E[quivalent] in English for the German word Schadenfreude [...] Therefore, we will rarely find occurrences of Schadenfreude in German translations of English texts. Generally speaking, translations in language B will contain `grosso modo' only those lexical items which count as TEs for items of the vocabulary of language A. The same is true for syntax. The `impersonal passive' (e.g. Es wurde viel getrunken, literally `It was drunk a lot') is a fairly common syntactic construction in German for which there is no equivalent in English. (Teubert 1996: 247)
Using translations as models for the TT thus risks reproducing those features of "translationese" which have been identified by workers using corpora in descriptive translation studies: normalization, simplification, explicitation (Baker 1993, 1998), "sanitization" by reducing connotational meanings (Kenny 1998), increased cohesion (Overs 1998), and lower lexical density, higher mean sentence length, and higher proportions of high-frequency words (Laviosa 1998). Gellerstam (1996) shows how translations into Swedish of English texts carry over many features of English vocabulary, syntax, and rhetoric when compared with comparable Swedish originals; Gavioli and Zanettin (1997) illustrate some similar features in Italian translations from English. Using parallel corpora seems likely to reinforce such tendencies (though it is of course possible that they may increase learners' awareness of these features, and hence their conscious control of them: Ulrych 1997).
The unreliability of the translations in parallel corpora makes it advisable to use them in conjunction with monolingual or comparable corpora, so that, for instance, a translation hypothesis derived from a parallel corpus can be tested against a collection of original texts in the language in question. The ideal parallel corpus, from this point of view, will be bidirectional or reciprocal (cf 1 above), allowing the user to see whether occurrences found in translations into language B are also found in original texts in language B, and whether these are translated into language A in the manner encountered in original texts in language A. Such a corpus combines the advantages of a parallel corpus with those of a comparable one: from this point of view, bidirectional English-Italian corpora would seem an important area for future research and development. Such corpora are however considerably more difficult to design and compile than comparable ones, given the need to create comparable collections of texts which have been translated, and to align the texts and translations prior to use. Given the amount of work involved, they are likely to be relatively unspecialized in order to extend their range of application (see e.g. the English-Norwegian parallel corpus: Johansson and Hofland 1993). Consequently there is still likely to be a role for comparable and unidirectional parallel corpora of a more specialized nature. One form of the latter may be compiled by the specialized translator (or their client), drawing on the texts that s/he has (had) translated in the past (cf note 5 above).
It should be noted en passant that parallel concordancing software can also be used to analyse a single text and its translation. This is potentially a useful tool for translators to check and evaluate their own translations. Aligned versions of the ST and TT can be used to see whether a particular term in the ST has been translated consistently in the TT, or whether (given the tendency of translations to be less lexically varied than their source texts) a particular expression in the TT corresponds to a variety of expressions in the ST. Type/token ratios and lexical density measures for the ST and TT can also be compared, and evaluated by comparison with those found in comparable or parallel corpora of similar texts.
There is as yet little hard empirical evidence to demonstrate the effectiveness of corpora as translation and as learning tools. Williams (1996) found a 40% improvement in the recovery of correct equivalents when "parallel" texts were used as translation aids as opposed to bilingual dictionaries, and one might expect these results to be matched or bettered with larger collections of texts in electronic format, and the aid of retrieval software. In a pilot experiment Bowker (1998) found that learners using a specialized corpus of texts in the target language (their L1) showed greater correct term choice and idiomaticity than a matched group using bilingual dictionaries alone. On the other hand, Bernardini and Aston (forthcoming) found that on two translation tasks into the L2, learners using monolingual L2 dictionaries performed better than matched groups using a general L2 monolingual corpus. While learners seem to a large extent enthusiastic about using corpora, it remains to be shown just in what respects, and under what conditions, their performance as translators may improve as a consequence: we cannot for instance exclude the idea that training with corpora may also improve dictionary usage, by instilling greater attention to collocation and register. No research that I am aware of has yet attempted to compare the effectiveness of different types of corpora, or of different learner approaches to them; yet more difficult to measure are the overall effects of corpus use on learning, be this in terms of general linguistic knowledge and ability, or as relating to a specialized text-type.
In this climate of empirical uncertainty, arguments for and against the use of corpora in translator training must be of a theoretical nature, and can resort at best to anecdotal evidence. Where available and accessible, appropriate corpora appear able to provide better and faster solutions to many of the translator's problems in a unified environment, with positive effects on learning. They make possible more idiomatic, native-like interpretations of source texts and a use of more idiomatic, native-like strategies in target texts. It is our experience at Forli' that few trainee-translators who have used corpora would wish to be without them, notwithstanding (or because of?) the investment in time and effort required to compile corpora and to learn how to use them, and we expect that as the number of available corpora and the quantity of suitable software increases, the use of corpora for translation and translator-training will gather further momentum, with a growth in its cost-effectiveness.
1. The Parole project aims to produce general comparable corpora for all the languages of the EU (http://www.ilc.pi.cnr.it/parole/parole.ht ml).
2. Parallel corpora can be extended to include multiple languages (Woolls 1997), or multiple translations of each text (Ulrych 1997, Malmkjaer 1998). As the value of such extensions seems more descriptive than pedagogic, I shall not discuss them here.
3. In the gave up sense, su is of course an adverb rather than a preposition. If the corpus used is tagged with part-of-speech codes (as is the case with the BNC and the Bank of English), it may be possible to avoid unwanted senses by searching for a specific part of speech, e.g. dare su=PRP (or an equivalent formalism). Part-of-speech tagging may also facilitate analysis, enabling the data to be sorted by part-of-speech code.
4. Bowker (1998) and Pearson (1996, 1998) argue that where specialised corpora are used to train translators in a specialised field, they should include a range of different types of text - expert, instructional, and popularised. The latter types, they argue, are likely to explain terms and concepts which are taken for granted in expert texts. However, it is important not to confuse these types in the corpus, since we would not, for example, expect divulgative texts to have the same collocational and colligational regularities as specialist ones, nor to contain the same range of terms as the latter. Where the corpus is used to translate a specific text, the appropriate component should be given priority.
5. King (1997: 396) compares the number of types in translations of Le petit prince with the French original: scoring the latter as 100, figures for English and for Italian are 83 and 107 respectively.
6. This may, for instance, be dubious if all the translations in the corpus have been produced by the same translator, as is often the case with "translation memory" systems.