Introduction to Corpus Linguistics. Sandrine Zufferey

Читать онлайн.
Название Introduction to Corpus Linguistics
Автор произведения Sandrine Zufferey
Жанр Учебная литература
Серия
Издательство Учебная литература
Год выпуска 0
isbn 9781119779704



Скачать книгу

from it. On the other hand, other disciplines such as mathematics or philosophy are traditionally based on a rationalist approach, since mathematicians and philosophers use their own reasoning to build theories and to draw conclusions, rather than from the collection and observation of external data. Philosophers often resort to thought experiments, but these are not experiments in the empirical sense of the term, because they are based on the reflective abilities of researchers.

      Although corpus linguistics has experienced a strong growth over the past 20 years, the empirical grounding of linguistics is not new. Linguists have long used observational data. In the 19th Century, for example, linguists used to work on the comparison of Indo-European languages in an attempt to reconstruct their common origin. Research was based on existing data about the languages spoken in Europe such as German, French and English. Similarly, in the first half of the 20th Century in the United States, the so-called distributionist approach to syntax focused on the study of sentence formation in syntactic structures as they appeared in text corpora, and from there, tried to infer language’s general functioning. Around the late 1950s, the use of corpora in linguistics was almost completely interrupted in certain fields such as syntax, following the works of the American linguist Noam Chomsky. In fact, Chomsky defended a strictly rationalist methodological approach to linguistics, and fiercely opposed any use of external data. The objections made by Chomsky against the use of external data in linguistics have been numerous. We will briefly review them, to show in what ways most of them have lost their raison d’être in the context of current research.

      Chomsky’s first objection to the use of corpora, which is also the most fundamental one, is that corpora contain language samples produced by speakers. According to him, linguistics should not focus on the linguistic performance of speakers, but on the competence they have in their mother tongue, something he calls their internal language. Now, here is the problem. When people speak, what they produce (their performance) does not necessarily reflect what they know about their language (their competence). For example, under the effect of stress or fatigue, speakers sometimes produce verbal slip-ups or make language mistakes. From time to time, almost everybody happens to badly conjugate an irregular verb and mistakenly produce the form “he eated” instead of “he ate”. However, if the person who produced this wrong form were recorded, and then asked whether he or she thought he or she had spoken correctly or not, we can almost be sure that he or she would realize his or her mistake and would be able to state the correct form, “he ate”. Conversely, a speaker could pronounce a word like “serendipity” after having heard it from somebody else’s lips, but without really knowing its meaning. These examples illustrate the fact that the words speakers “utter” are not always a true reflection of their linguistic competence. In this way, according to Chomsky, the fact of studying corpora places linguists on the wrong track, because they lead them to consider language from the point of view of “production”, which merely represents a biased reflection of the rules of language.

      This limitation has led to Chomsky’s third criticism of corpora, namely the fact that a corpus can never contain the whole of a language and that, therefore, the above-mentioned biases are not solvable. According to him, this problem is all the more serious because even if a corpus were of a very large size and included a representative portion of the language, it would not be fully analyzable by linguists, given the fact that it is impossible to manually analyze the content of billions of sentences.

      Chomsky’s last two objections have largely become obsolete due to the advances made in computer science. In fact, the size of corpora has increased exponentially over the past 20 years, and corpus analysis tools have also made considerable progress. It has thus become possible to analyze very large amounts of data, which represent a much more accurate mirror of the language than when Chomsky formulated his objections. We will return to this in section 1.4, devoted to the connections between computer science and corpus linguistics. In addition to these technological advances, theoretical and methodological advances have also largely made it possible to eliminate or control the other types of biases mentioned by Chomsky. For example, good practice for building a corpus is to accurately document the type of language it contains. This helps to avoid analyzing the language of a single aphasic subject by mistake, for example, as Chomsky might suggest. It is nonetheless true that a corpus can only show that which it contains, and therefore the absence of evidence that a word or a structure exists in a corpus cannot constitute definitive proof of their absence from the language. Thus, for certain research questions relating to rare or hardly observable phenomena in a corpus, it might be advisable to complement research with another empirical method, namely with the experimental method. As we will see later in this chapter, this method shares the use of a quantitative methodology with corpus linguistics.

      What is more, in many areas of linguistics such as lexicology, language acquisition and sociolinguistics, the idea of relying on the internal judgments of linguists is simply not conceivable. No one can study children’s language by remembering how he or she spoke as a child, or make assumptions about language differences between men and women by imagining how he or she would speak if he/she were a man or a woman. In all these fields, the use of text corpora has been obvious for a long time and corpora use was never interrupted as a result of Chomsky’s work. The paradigm shift in recent decades has taken place in areas where it is conceivable to use a purely rationalist methodology, for example syntax.