Morphological annotation of texts in the Slovak National Corpus

Morphological annotation is a fundamental and very common linguistic information found in corpora, especially for inflectional languages. It comprises the grammatical (part of speech) and morphological features of a word in context. It is usually preceded by the process of lemmatization - assignment of the basic form to a particular lexeme.

In the Slovak National Corpus there are two types of morphological annotation and lemmatization:

1. manual morphological annotation in the subcorpus r-mak based on a set of tags and rules concluding the lemmatization rules,

2. automatic morphological annotation is using the r-mak manually annotated subcorpus as a source of train data using the same set of tags and rules with the exception of some that are mentioned elsewhere.

All the tags can be viewed in the following charts. Examples are taken from the manually annotated subcorpus.

Noun	Preposition	Punctuation
Adjective	Conjunction	Undefinable part of speech
Pronoun	Particle	Non-verbal element
Numeral	Interjection	Foreign language citation
Verb	Reflexive morpheme	Number
Participle	Conditional morpheme	Proper name
Adverb	Abbreviation, symbol	Incorrect spelling

The whole document about morphological annotation can be found in PDF format here. A portion of the texts in the manually annotated subcorpus can be found in the section entitled Corpus Structure.

All text units (tokens - a string of characters in between two spaces, as well as punctuation marks preceded by spaces) are processed according to morphological annotation. This is necessary in order to get the absolute co-occurrence. In further processing each token is assigned the attributes: lemma and tag.

A lemma is a dictionary entry of a token. In manual annotation all lemmas start with a lower-case letter. To designate a proper name we use lower case “r” at the end of a tag following a colon ( : ). In the manual annotation the negation of verb forms are lemmatized as negation of the infinitive. Affirmation and negation are marked either by “+” or “-” . All the negated forms are lemmatized without morpheme “ne-” e.g. Nevedeli o tom (They didn't know about it. – the automatic lemmatization is Vedieť, o, to (to know, about, it).

Morphological tags comprise Latin alphabet letters, numbers and mathematical symbols. Each category or a specific feature is assigned a particular character, which can refer to several parts of speech (e.g. x, y, z stand for positive, comparative and superlative of adjectives or adverbs). A set of characters represents a single tag for one token.

A tag indicates values of formal categories relevant to the token. Tags with a variable number of symbols are used in the SNK, but their order is obligatory. The first step is to align a word with the part-of-speech classification or with a word class (specific text units such as punctuation marks, etc.) followed by symbols of grammatical categories (obligatory) or symbols for specific groups. For further information about tokenization, lemmatization and morphological annotation see the PDF file (270 kB).

Searching in Bonito 2 (web client)

You can search the corpus using the manager Manatee or client Bonito. You can search for a word form or lemma.

Choose the desired corpus. In the case of a morphologically annotated corpus, choose r-mak (1.0 or 2.0 versions).
In the Query Type choose Lemma for base word form or Word Form for a particular word form.
Enter a particular word form in the Query box for Word form or base form for Lemma.
To see all the characteristics, click on View Options in the leftmost column and in Attributes tick the attributes lemma or tag for a key word or all the words in the searched context.