Introduction to Corpus Linguistics. Sandrine Zufferey

Читать онлайн.
Название Introduction to Corpus Linguistics
Автор произведения Sandrine Zufferey
Жанр Учебная литература
Серия
Издательство Учебная литература
Год выпуска 0
isbn 9781119779704



Скачать книгу

data as the only point of reference, both in a theoretical and a methodological sense. In this approach, linguists begin their research without an a priori and simply let hypotheses emerge from corpus data (this is called a corpus-driven approach). This approach is almost unanimous among linguists working with an empirical methodology. On this point, we agree with Chomsky’s metaphorically explained opinion where he states that working with linguistics in this way would be the equivalent for physicists of hoping to discover the physical laws of the universe by looking out of their window. Observing data without a hypothesis often leads to not being able to make sense of data. It is for this reason that the approach that we will adopt in this book corresponds to a corpus-based approach, considering these as available tools for linguists to be able to test their hypotheses.

      As we have seen above, corpus linguistics, as performed nowadays, cannot do without computers. Even if works related to corpus linguistics have existed for a long time (such as the indexing of the Bible by theologians or the file-based construction of dictionaries by scholars like Antoine Furetière in French or Samuel Johnson in English), this discipline was only able to properly take off after the arrival of computing.

      In particular, concordancers are useful for searching all the occurrences of a word, plus their context of use and for displaying the results line by line in a single query. These tools also make it possible to establish the list of words contained in the corpus, together with their frequency, and to generate a list of keywords matching the content of a corpus. In the case of corpora containing texts as well as their translation, certain tools called aligners make it possible to align the content of the corpus sentence by sentence. That being done, bilingual concordancers search directly for the occurrences of a word in one of the two languages of the corpus, and simultaneously extract the matching sentence in the other language. We will learn how to use these tools in Chapter 5, which is devoted to the presentation of the main French corpora, as well as the tools for analyzing them.

      Another problem might arise if we decide to study the use of relative phrases such as “the girl who is intelligent” or “the violin which was left on the bus”. For this study, a good starting point would be to look for relative pronouns such as who or which in order to find occurrences of relative sentences in the corpus. The problem is that these pronouns are also used in interrogative sentences such as “Who do you prefer?” or “Which hat is yours?” In this case, looking for the grammatical category of the word will not solve the problem, because they are both pronouns. In order to find only the occurrences of who and which as relative pronouns, we should use a corpus in which the syntactic structure of each sentence has been analyzed in such a way that we can assign a grammatical function to each word and group them into syntactic constituents. Tools for analyzing the syntactic structure of sentences have also been developed in the context of works for automatic language processing. These automatic analyses still require human checks so as to avoid any form of error, but their performance is continually improving. The arrival of these tools has greatly accelerated research in corpus linguistics. We will discuss this issue in Chapter 7, which is devoted to annotations.

      But corpus linguistics was not only developed thanks to the creation of such tools. Above all, it is the general development of computers and the digital revolution which have made the greatest advances possible. In fact, the increase in the computing power of machines – as well as in their memory – has made it possible to build ever larger corpora. Until the 1980s, a corpus of a million words was considered to be a very large corpus. For instance, the first reference corpora (such as the Brown corpus developed for American English in the early 1960s) were about this size. At the same time, the arrival of cassette recorders to the market enabled the first creations of oral corpora containing an exact transcription of spoken speech, rather than a synthesis taken in shorthand.

      The marketing of scanners in the 1980s later made it possible to digitize a significant amount of data and corpora began to reach larger sizes, up to 20 million words. Then, with the democratization of computer use, the amount of digitally disseminated texts greatly accelerated the growth of corpora. Finally, since the beginning of 21st Century, the wide dissemination of documents online via the Internet has given another dimension to the size of corpora available to researchers. At present, the Google Books corpus, for example, contains more than 500 billion words, which represents approximately 4% of all the published books of all time (Michel et al. 2011). We will discuss the possible uses of such a corpus in the following chapters. In Chapter 6, we will also see that the Internet potentially offers an exceptional data resource for corpus linguistics, but that Internet research cannot be used without an additional processing step if we are to grant data quality.

      We have seen that computers help us to work on very large corpora and automatically count word occurrences, find keywords, etc. The need to use a large amount of data and the desire to quantify the presence of linguistic elements in a corpus corresponds to a quantitative research methodology. This methodology involves observing or manipulating variables, as well as the use of statistical tests. The main objective is to test a limited number of variables, in a highly controlled environment whenever possible and on a language sample that can be representative of the phenomenon studied. This can later make it possible to generalize the results obtained to the whole language or to a part of the target language (e.g. journalistic language). These methods nonetheless imply a certain form of reductionism and a simplification of reality. Ultimately, the addition of studies with well-defined and properly controlled variables may provide a global and realistic picture of a phenomenon.

      Let us take an example. Suppose we want to test the hypothesis that women talk more about their feelings than men. To test this hypothesis by means of a corpus study, we should first make sure that we are comparing records of men and women produced in the same context, for example, in the context of friendly discussions around