EducationColleges and Universities

What is corpus linguistics?

A few decades ago, scientists could only dream of automating linguistic research. The work was done manually, a large number of students were involved in it, there was a significant probability of error "by inattention", and most importantly - it all took a lot, very much time.

With the development of computer technology, it became possible to conduct studies an order of magnitude faster, and today one of the most promising areas in the study of language is corpus linguistics. Its main feature is the use of large amounts of textual information, combined in a single database, specially marked and named the body.

To date, there are many buildings created for different purposes, on the basis of different linguistic material, covering from millions to tens of billions of lexical units. This direction is recognized as promising and demonstrates significant progress in achieving applied and research goals. Specialists who somehow deal with the natural language, it is recommended that you familiarize yourself with the corpus of texts, at least at a basic level.

History of corpus linguistics

The formation of this direction is associated with the creation in the US of the Brown Corps in the early 1960s. The collection of texts contained only 1 million word forms, and today the corps of such a volume would be completely uncompetitive. To a large extent, this is due to the pace of development of computer technologies, as well as the growing demands for new research resources.

In the 1990s corpus linguistics was formed into a full-fledged and independent discipline, collections of texts were compiled and marked out for several dozen languages. During this period, for example, the British National Corps was created for 100 million usage.

As this line of linguistics develops, the volumes of texts become more and more (and reach billions of vocabulary units), and the markup turns out to be more and more diverse. Today, in the Internet space, you can find cases of written and spoken speech, multilingual and teaching, oriented to artistic or academic literature, as well as many other varieties.

What are the bodies

Types of cases in cabinet linguistics can be presented for several reasons. It is intuitively clear that the basis for classification can be the language of texts (Russian, German), the mode of access (open source, closed, commercial), the genre of source material (fiction, documentary, academic, journalism).

An interesting way is the generation of materials representing oral speech. Since the intentional recording of such a speech would create artificial conditions for the respondents, and the resulting material could not be called "spontaneous", modern corpus linguistics took a different path. The volunteer is equipped with a microphone, and during the day, all conversations in which he participates are recorded. Surrounding people, of course, can not know that during a household conversation they contribute to the development of science.

Later, the received audio recordings are stored in a database and are accompanied by printed text according to the type of the transcript. Thus, the markup necessary to create a body of oral everyday speech becomes possible.

Application

Where use of the language is possible, it is also possible to use text boxes. The purpose of applying hull methods in linguistics can be:

  • Creation of tone determination programs, actively used in politics and business to track positive and negative feedback from voters and customers, respectively.
  • Connecting the information system to dictionaries and translators to improve their performance.
  • A variety of research tasks that contribute to understanding the structure of the language, the history of its development and predictions of its change in the near future.
  • Development of information retrieval systems based on morphological, syntactic, semantic and other characteristics.
  • Optimization of the work of various linguistic systems, etc.

Use of housings

The resource interface is similar to a typical search engine and prompts the user to enter some word or a combination of words to search through the information base. In addition to the form of an exact query, you can use the extended version, which allows you to find text information for almost any linguistic criteria.

The basis for the search can be:

  • Belonging to a certain group of parts of speech;
  • Grammatical signs;
  • semantics;
  • Stylistic and emotional coloring.

In addition, you can combine search criteria for a sequence of words: for example, to find all occurrences of a verb in the present tense, the first person, the singular, followed by the preposition "c" and the noun in the accusative case. The solution of such a simple task takes a few seconds for the user and requires only a few clicks in the specified fields.

Process of creation

The search itself can be carried out both on all subcorps, and on one, specifically chosen, depending on the needs when achieving a specific goal:

  1. First of all, it is determined which texts will form the basis of the case. For practical purposes, often used journalistic, newspaper materials, online comments. In the research projects, a variety of types of enclosures are used, but the texts should be selected according to some common ground.
  2. The resulting set of texts undergoes preprocessing, errors are corrected, if available, a bibliographic and extra-linguistic description of the text is prepared.
  3. All non-text information is cleared: graphics, pictures, tables are deleted.
  4. There is a selection of tokens, usually representing words, for their further processing.
  5. Finally, the morphological, syntactic and other marking of the resulting set of elements is realized.

The result of all the operations performed is a syntactic structure with a set of elements distributed over it, for each of which a part of the speech is defined, grammatical and, in some cases, semantic features.

Difficulties in building corps

It is important to understand that it is not enough to gather a lot of words or sentences to get the case. On the one hand, the collection of texts should be balanced, that is, to represent different types of texts in certain proportions. On the other hand, the contents of the case must be specially marked.

The first question is solved by agreement: for example, 60% of art texts, 20% of documentaries are included in the collection, a certain proportion is given to written representation of oral speech, legislative acts, scientific works, etc. The ideal recipe for a balanced body does not exist today.

The second question, concerning the markup of content, is more difficult to solve. There are special programs and algorithms used for automatic markup, but they do not give a 100% result, they can cause failures and require manual revision. The possibilities and problems in solving this problem are described in detail in Zakharov's work on corpus linguistics.

The markup of the text is carried out on several levels, which we will list below.

Morphological marking

From the school bench we remember that in Russian there are different parts of speech, and each of them has its own characteristics. For example, the verb has categories of mood and time that the noun does not have. The native speaker does not hesitate to decline the nouns and conjugate the verbs, but manual labor does not fit to mark the case in 100 million words. All necessary operations can be performed by a computer, however, for this it is required to be taught.

Morphological marking is necessary for the computer to "understand" each word as a certain part of the speech having certain grammatical features. Since there are a number of regular rules in Russian (as in any other language), it is possible to build an automatic procedure for morphological analysis by investing a number of algorithms in the machine. However, there are exceptions to the rules, as well as various complicating factors. As a result, pure computer analysis today is far from ideal, and even 4% of errors gives 4 million words per case to 100 million units, requiring manual revision.

In detail this problem is described by Zakharov VP "Corpus linguistics".

Syntactic markup

Parsing or parsing is the procedure that determines the relationship of words in a sentence. With the help of a set of algorithms, it becomes possible to define in the text the subject, predicate, additions, various turns of speech. Finding out which words in the sequence are main and which are dependent, we can effectively extract information from the text and train the machine to issue only the information that interests us in response to the search query.

By the way, modern search engines use this to produce specific figures instead of lengthy texts in response to the corresponding queries such as "how many calories in an apple" or "the distance from Moscow to Petersburg". However, to understand even the very basics of the described process, you need to familiarize yourself with the "Introduction to Corpus linguistics" or other basic teaching aid.

Semantic Markup

The semantics of a word is, in simple terms, its meaning. A widely applicable approach in semantic analysis is attributing words to the word, reflecting its belonging to a set of semantic categories and subcategories. Such information is valuable for the optimization of algorithms for analyzing the tonality of text, automatic abstracting and other tasks using the methods of corpus linguistics.

There are a number of "roots" of the tree, representing abstract words, having a very wide semantics. As this tree branches, nodes are formed that contain increasingly specific lexical elements. For example, the word "being" can be associated with such concepts as "man" and "animal". The first word will be further branched to various professions, terms of kinship, nationality, and the second - to classes and species of animals.

Application of information retrieval systems

The fields of the use of corpus linguistics cover a wide variety of fields of activity. The cases are used for compiling and correcting dictionaries, creating automatic translation systems, abstracting, extracting facts, determining the key and other word processing.

In addition, such resources are actively used in the study of the languages of the world and the mechanisms of the functioning of the language as a whole. Access to a large volume of pre-prepared information facilitates an operative and comprehensive study of trends in the development of languages, the formation of neologisms and stable speech turnover, changes in the values of lexical units,

Since working with such large volumes of data requires automation, today there is a close interaction of computer and corpus linguistics.

National building of the Russian language

This building (abbreviated as NKRY) includes a number of subcorps that allow using the resource for solving a wide variety of tasks.

Materials in the base of the NKRN are subdivided:

  • On the publication in the media of the 90s and 2000s both domestic and foreign;
  • Records of oral speech;
  • Accentually marked texts (ie with marks on stress);
  • Dialect speech;
  • Poetic works;
  • Materials with syntactic marking, etc.

The information system also includes subcorps with parallel translations of works from Russian into English, German, French and many other languages (and vice versa).

Also in the database there is a section of historical texts representing written speech in Russian during various periods of its development. There is also an educational building, which can be useful for foreign citizens in mastering the Russian language.

The National Corpus of the Russian language includes 400 million lexical units and, in many respects, outstrips a significant part of the language buildings of Europe.

Prospects

The fact that the laboratories of corpus linguistics in Russian universities, as well as in foreign ones, is promising is a fact in favor of recognizing this direction. With the application and research in the context of the information and search resources under consideration, the development of certain areas in the field of high technology, question-answer systems is involved, but this has been discussed above.

Further development of corpus linguistics is predicted at all levels, starting from technical, in terms of introducing new algorithms that optimize the processes of searching and processing information, expanding the capabilities of computers, increasing RAM, and ending with everyday, as users are finding more ways to use this type of resources in everyday Life and work.

Finally

In the middle of the last century, 2017 was a distant future, in which spacecraft plow the expanses of the universe and robots perform all the work for people. In reality, science abounds in "white spots" and makes desperate attempts to answer questions that have troubled humanity for centuries. The questions of the functioning of the language here occupy an honorable place, and corpuscular and computer linguistics can help us to answer them.

Processing large data sets allows you to detect patterns that are not available previously, predict the development of certain language features, monitor the formation of words in real time.

At a practical global level, corps can be considered, for example, as a potential tool for assessing public sentiment - the Internet is an ever-expanding database of various texts created by real users: these are comments, and reviews, and articles, and many other forms of speech.

In addition, work with the corps contributes to the development of the same technical means that participate in the information search, which is familiar to us on Google or Yandex services, machine translation, electronic dictionaries.

It can be confidently asserted that corpus linguistics makes only the first steps, and in the near future will develop rapidly.

Similar articles

 

 

 

 

Trending Now

 

 

 

 

Newest

Copyright © 2018 en.birmiss.com. Theme powered by WordPress.