
浙江大学肖忠华语料库Corpus LinguPPT.ppt
45页Corpus design and types of corpora,Corpus Linguistics Richard Xiao ,Outline of the session,Corpus design issues Corpus representativeness Corpus balance Sampling Corpus size Types of corpora Introducing some well-known English corpora of different types,Representativeness,A corpus is a collection of (1) machine-readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety A corpus is different from a random collection of texts or an archive Representativeness is a defining feature of a corpus As language is infinite but a corpus has to be finite in size, we sample and proportionally include a wide range of text types to ensure maximum balance and representativeness,Some definitions ,“generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type” (Leech 1992: 116) “selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language” (Sinclair 1996) “A well-organized collection of data” (McEnery 2003) “gathered according to explicit design criteria” (Tognini-Bonelili 2001: 2) “built according to explicit design criteria for a specific purpose” (Atkins et al 1992) texts selected and put together “in a principled way” (Johansson 1998: 3),What is representativeness?,“A corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety” (Leech 1991) Representativeness refers to the extent to which a sample includes the full range of variability in a population (Biber 1993),What is representativeness?,Representativeness is a fluid concept closely related to your research questions If you want a corpus which is representative of general English, a corpus representative of newspapers will not do If you want a corpus representative of newspapers, a corpus representative of The Times will not do,Two types of representativeness,The representativeness of general corpora and (domain- or genre specific) specialized corpora are measured in different ways General corpora Balance: The range of genres included in a corpus and their proportion Sampling: How the text chunks for each genre are selected Specialized corpora Degree of closure/saturation: Closure/saturation for a particular linguistic feature (e.g. size of lexicon) of a variety of language (e.g. computer manuals) means that the feature appears to be finite or is subject to very limited variation beyond a certain point, i.e. the curve of lexical growth is flattening out,Why should we care about representativeness?,Reader of corpus-based studies (assessment) To interpret the results of corpus research with caution, considering whether the corpus data and the method used in the study was appropriate Corpus user (assessment) Important to “know your corpus” To decide whether a given corpus is appropriate for their specific research question To make appropriate claims on the basis of such a corpus Corpus creator (assessment?) To make their corpus as representative as possible of a language (variety) claimed to represent To document design criteria explicitly and make the documentation available to corpus users,Criteria for text selection,The criteria used to select texts for a corpus are principally external The external vs. internal criteria corresponds to Bibers (1993: 243) situational vs. linguistic perspectives External criteria are defined situationally irrespective of the distribution of linguistic features Internal criteria are defined linguistically, taking into account the distribution of such features It is circular to use internal criteria like the distribution of words or grammatical features as the primary parameters for the selection of corpus data If the distribution of linguistic features is pre-determined when the corpus is designed, there is no point in analyzing such a corpus to discover naturally occurring linguistic feature distributions,Criteria for text selection,Time? If a corpus is not regularly updated, it rapidly becomes unrepresentative (Hunston 2002) The relevance of permanence in corpus design actually depends on how we view a corpus - a static or dynamic language model Static model: sample corpora (nearly all existing corpora, BNC, LOB/FLOB) Dynamic model: Bank of English,Criteria for text selection,Tips “Criteria for determining the structure of a corpus should be small in number, clearly separate from each other, and efficient as a group in delineating a corpus that is representative of the language or variety under examination.” (Sinclair 2005),Corpus balance,A balanced corpus covers a wide range of text categories which are supposed to be representative of the language (variety) under consideration The proportions of different kinds of text it contains should correspond with informed and intuitive judgeme。
