Web Corpus
1. Text on page: Mária Šimková
The current version of the corpus web-5.0 containing 4 042 363 283 tokens was released in January 2020.
2. Corpus design and development: R. Garabík, corpus data: V. Benko, final development: I. Uhliarik.
The corpus is lemmatised and morphologically annotated by MorphoDiTa tagger which has been trained and tuned on tagset developed by the SNK, when texts are given basic information about their URL and time of acquisition.
Version web-4.0
The current version of the corpus web-4.0 containing 2 963 462 451 tokens was released in January 2018.
1. Corpus design and development: R. Garabík, new corpus data: V. Benko.
The corpus is lemmatised and morphologically annotated by MorphoDiTa tagger which has been trained and tuned on tagset developed by the SNK, when texts are given basic information about their URL and time of acquisition.
Version web-3.0
1. Corpus design and development: R. Garabík, R. Brída, new corpus data: V. Benko.
The version of the corpus web-3.0 containing 2 372 769 958 tokens was released in March 2015.
Web corpus was a collection of Slovak texts downloaded from the web that were provided by the Faculty of Informatics of Masaryk University in Brno in 2010 (a collection of 988 474 323 tokens, including duplicate content and texts in Czech), also Slovak texts downloaded from the web by SNC during 2011–2012 (489 869 717 tokens, excluding duplicate content and foreign texts) and Slovak texts from the project Araneum (3 221 914 708 tokens, including duplicate content and foreign texts).
The corpus texts are lemmatised and morphologically annotated, bibliography is provided. The lists of the 1000 most frequent word forms and lemmas are available here.
Version web-2.0
1. Corpus design and development: R. Garabík, R. Brída.
The version web-2.0 containing 1 045 558 148 tokens was released in March 2012.
Here you can find a list of the 1000 most frequent word forms and lemmas, as well as the complete lists by frequency.
Version web-1.0
1. Corpus design and development: R. Garabík, data provided by Faculty of Informatics, Masaryk University in Brno
The first version web-1.0 was released in 2011. The corpus, containing 952 095 260 tokens was developed jointly with the Faculty of Informatics, Masaryk University in Brno.