Written corpora

Parallel corpora

Web corpus

The current version of the corpus web-2.0 containing 1 045 558 148 tokens was released in March 2012.

The old version web-1.0 was released in 2011. The corpus, containing 952 095 260 tokens was developed jointly with the Faculty of Informatics, Masaryk University in Brno.

Text corpus from Wikipedia and Uncyclopedia

The current version wiki-2014-02 was released in February 2014. The database contains 37 548 997 tokens including Slovak texts only from Wikipédia and Necyklopédia.

A corpus of legal regulations of the Slovak Republic legal-1.0 prepared in collaboration with the Ministry of Justice of the Slovak Republic. The corpus has been available since 2011. It is a 146 million token corpus.

Omnia

Omnia consists of the following corpora prim-6.0-public-all, s-hovor-4.0, legal-1.1, web-1.1 and web-1.2. In the corpora, the duplicate content has been removed and several minor changes applied. In tokenization, compounds are treated as a single token. In lemmatization, negative forms are lemmatized by the affirmative form. The corpus has been derived from the SNK by V. Benko. It serves primarily as material for members of the department of lexicology and lexicography.

The current version omnia-2.0-public was released in July 2013. Omnia contains 2 239 413 083 tokens. The previous versions were for internal use only.

Slovak Terminology Database

The database contains about 6 000 terminological records on 23 related subjects.

WordNet

WordNet is a lexical database including information about semantic relations of words. It is aligned with the Princeton 3.0 WordNet. Slovak synsets are linked to the English equivalents.

Corpus of Slovak Spoken

The latest version s-hovor-4.0 is a 2.6 million token corpus which has been available since August 2012.

Corpus of Historical Slovak

The first version hks-1.0 containing 370 758 tokens has been available since November 2012. It consists of electronically processed texts published in Pramene k dejinám slovenčiny, I. - III; (Sources for the History of Slovak).

Corpus of Crimean Tatar language