What is a corpus?

A corpus of texts is a specific set of linguistic data presented in an electronic form. It is made up of texts that are usually of many different styles and genres accompanied by linguistic information. The search tools enable search and selection of desired linguistic units and information. Corpora are a good source of authentic information, which linguists can use to write accurate papers on the function and meaning of words, as well as describe other linguistic phenomena, such as statistics about words, their collocability, etc. Using a corpus enables common users to better understand the system of language and to verify or enrich their knowledge of how linguistic units function in real situations. It is not a replacement for linguistic reference books.

The Slovak National Corpus is a scientific and research project for building a corpus of electronic text, which was in the first period focused on contemporary Slovak written texts from the period 1955 – 2005. In its second and third period, it has been expanded to provide a wider array of texts, including texts from other periods (before 1955 and after 2005). It also covers various language types (spoken Slovak and dialects to a limited extent). Since 2002, the SNK Department of the Ľ. Štúr Institute of Linguistics at SAS has been carrying out systematic and comprehensive research on the Slovak language and subsequently putting it into electronic form.

For a clearer understanding of terms, see Výberový slovník termínov z korpusovej lingvistiky (Selective Dictionary of Corpus Linguistics Terms).

What are the types of corpora?

Corpora are divided according to:

1. language

- monolingual corpora are available for many languages (national corpora)

- bilingual and multilingual (parallel) corpora: original texts and their translations

2. language form

-besides the corpora of written texts there are also spoken corpora

3. size

- the first corpora, which contained less than 1 million word forms existed until 1975. At present, there are several corpora containing billions of words

4. type of text

- the corpora may be either general (not further specified) or specialized depending on source and scope of linguistic issue (single-author text corpus, corpus of informal speeches, corpus of most recent texts in order to capture neologisms, etc.)

5. mode of preservation (storage)

- corpora can be either preserved in basic text form, or be lemmatised (a single word is accompanied by base form of word) and morphologically, syntactically, semantically, or stylistically annotated

6. date of origin

- synchronic corpora are oriented to illustrate the contemporary state of a language

- diachronic corpora provide the evolution stages of a language over time

Corpora can be representative or balanced. A representative corpus is a wide set of real examples of use of a national language in all its forms and diversity (diverse text types, genres, various authors...). A balanced corpus is usually based on proportional sampling of main text types, the other parameters such as genre, domain, etc. are registered only.

How to build a corpus?

The building of a corpus has several stages:

Getting permission for use. Corpora are used for non-commercial purposes and the texts are obtained from contributors (authors, publishers and other copyright holders) under a license agreement.
Data capture. Most of the material is obtained in electronic form, or from the Internet, as well as by using OCR technology or by text transcription.
Corpus data processing. In the initial phase, the plain text is extracted. It cannot contain any characters, symbols, pictures, graphs, etc. Subsequently, the text is converted to a unified format. The text conversion also covers the tokenization (the process of breaking a stream of text up into its smallest elements called tokens (text units). In the following phase the text is usually tagged, attaching information to a word including information on structure of text, POS tagging, function of word, semantics, etc.).

How can be corpora used?

Corpora are used in various areas of research and applications, such as:

1. Corpus Linguistics

A branch of mathematic (computational) linguistics focused on various phenomena in corpora demonstrated in large amounts of “real world” text where we can spot words and language phenomena in their natural context. Based on analysis of the corpus texts, linguistic theories can be verified and new hypotheses or theories developed. Corpora are mainly useful in lexicography so the lexicographers can write more accurate and meaningful dictionaries.

2. Natural Language Processing

Some outcomes of NLP, such as word lists, collocations and word co-occurrences, word frequency, etc. can also be used in nonlinguistic applications. Examples include text processing systems (automatic grammar or spell checkers, machine translation), speech recognition systems, etc.

3. Teaching languages

A corpus is a database of phrases and sentences used for teaching a foreign language or mother tongue. A small corpus together with a dictionary can be part of a computer program for teaching. It provides evidence of how words are used in contexts in which they occur.