Attributes and structural values in the publicly available SNC corpora
Obsah
1. Written corpora − synchronous, general
corpus |
size − number of tokens / number of words |
lemmatisation, morphological annotation |
year of release |
characteristics |
attributes |
structures |
1 688 million tokens / 1 355 million words |
yes |
2020 |
all publicly available texts in SNC (71.0 % journalistic, 16.8 % fiction, 11.3 % professional and 0.9 % other texts) |
word, lemma, tag, word_lc lemma_lc |
doc, s, p, g, noise, hi |
|
1 650 million tokens / 1 323 million words |
yes |
2020 |
corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals |
word, lemma, tag, word_lc, lemma_lc |
doc, s, p, g, noise, hi |
|
572 million tokens / 459 million words |
yes |
2020 |
balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts) |
word, lemma, tag, word_lc, lemma_lc |
doc, s, p, g, noise, hi |
|
1 163 million tokens / 932 million words |
yes |
2020 |
subcorpus of journalistic (informational) texts |
word, lemma, tag, word_lc, lemma_lc |
doc, s, p, g, noise, hi |
|
189 million tokens / 153 million words |
yes |
2020 |
subcorpus of scientific, professional and popular science texts |
word, lemma, tag, word_lc, lemma_lc |
doc, s, p, g, hi |
|
283 million tokens / 226 million words |
yes |
2020 |
subcorpus of fiction texts |
word, lemma, tag, word_lc, lemma_lc |
doc, s, p, g, hi |
|
1 361 million tokens / 1 093 million words |
yes |
2020 |
subcorpus of original texts written in Slovak |
word, lemma, tag, word_lc, lemma_lc |
doc, s, p, g, hi |
|
97 million tokens / 78 million words |
yes |
2020 |
subcorpus of original fiction texts written in Slovak |
word, lemma, tag, word_lc, lemma_lc |
doc, s, p, g, hi |
|
109 million tokens / 87 million words |
yes |
2020 |
specific corpus of texts from years 1955–1989 (4.0 % journalistic, 81.2 % fiction, 11.1 % professional and 3.7 % other texts) |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
1 652 million tokens / 1 282 million words |
yes |
2020 |
all publicly available texts in SNC (74.0 % journalistic, 16.0 % fiction, 9.2 % professional and 0.9 % other texts) |
word, lemma, tag, prec, word_lc lemma_lc |
doc, s, p, g, noise, hi |
|
1 621 million tokens / 1 257 million words |
yes |
2020 |
corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, noise, hi |
|
454 million tokens / 355 million words |
yes |
2020 |
balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts) |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, noise, hi |
|
1 194 million tokens / 920 million words |
yes |
2020 |
subcorpus of journalistic (informational) texts |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, noise, hi |
|
150 million tokens / 117 million words |
yes |
2020 |
subcorpus of scientific, professional and popular science texts |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
263 million tokens / 208 million words |
yes |
2020 |
subcorpus of fiction texts |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
1 258 million tokens / 977 million words |
yes |
2020 |
subcorpus of original texts written in Slovak |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
93 million tokens / 74 million words |
yes |
2020 |
subcorpus of original fiction texts written in Slovak |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
99 million tokens / 79 million words |
yes |
2020 |
specific corpus of texts from years 1955–1989 (4.5 % journalistic, 78.6 % fiction, 12.4 % professional and 4.4 % other texts) |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
1 477 million tokens / 1 160 million words |
yes |
2018 |
all publicly available texts in SNC (71.1 % journalistic, 15.4 % fiction, 8.5 % professional and 5.0 % other texts) |
word, lemma, tag, prec, word_lc lemma_lc |
doc, s, p, g, noise, hi |
|
1 369 million tokens / 1 076 million words |
yes |
2018 |
corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, noise, hi |
|
377 million tokens / 298 million words |
yes |
2018 |
balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts) |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, noise, hi |
|
1 010 million tokens / 791 million words |
yes |
2018 |
subcorpus of journalistic (informational) texts |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, noise, hi |
|
122 million tokens / 96 million words |
yes |
2018 |
subcorpus of scientific, professional and popular science texts |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
224 million tokens / 178 million words |
yes |
2018 |
subcorpus of fiction texts |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
1 043 million tokens / 822 million words |
yes |
2018 |
subcorpus of original texts written in Slovak |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
83 million tokens / 66 million words |
yes |
2018 |
subcorpus of original fiction texts written in Slovak |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
84 million tokens / 67 million words |
yes |
2018 |
specific corpus of texts from years 1955–1989 (5.3 % journalistic, 75.3 % fiction, 14.0 % professional and 5.4 % other texts) |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, s, p, g, hi |
|
1 250 million tokens / 972 million words |
yes |
2015 |
all publicly available texts in SNC (65.1 % journalistic, 15.1 % fiction, 9.5 % professional and 10.3 % other texts) |
word, lemma, tag, prec |
doc, s, p, g |
|
1 089 million tokens / 849 million words |
yes |
2015 |
corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals |
word, lemma, tag, prec |
doc, s, p, g |
|
341 million tokens / 267 million words |
yes |
2015 |
balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts) |
word, lemma, tag, prec |
doc, s, p, g |
|
771 million tokens / 597 million words |
yes |
2015 |
subcorpus of journalistic (informational) texts |
word, lemma, tag, prec |
doc, s, p, g |
|
114 million tokens / 89 million words |
yes |
2015 |
subcorpus of scientific, professional and popular science texts |
word, lemma, tag, prec |
doc, s, p, g |
|
188 million tokens / 149 million words |
yes |
2015 |
subcorpus of fiction texts |
word, lemma, tag, prec |
doc, s, p, g |
|
807 million tokens / 630 million words |
yes |
2015 |
subcorpus of original texts written in Slovak |
word, lemma, tag, prec |
doc, s, p, g |
|
65 million tokens / 52 million words |
yes |
2015 |
subcorpus of original fiction texts written in Slovak |
word, lemma, tag, prec |
doc, s, p, g |
|
67 million tokens / 54 million words |
yes |
2015 |
specific corpus of texts from years 1955–1989 (7.4 % journalistic, 69.3 % fiction, 16.6 % professional and 6.7 % other texts) |
word, lemma, tag, prec |
doc, s, p, g |
|
830 million tokens / 656 million words |
yes |
2013 |
all publicly available SNC texts (68.8 % journalistic, 13.9 % fiction, 15.3 % professional and 2 % other texts) |
word, lemma, tag, prec |
doc, s, p, g |
|
773 million tokens / 610 million words |
yes |
2013 |
corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals |
word, lemma, tag, prec |
doc, s, p, g |
|
317 million tokens / 252 million words |
yes |
2013 |
balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts) |
word, lemma, tag, prec |
doc, s, p, g |
|
541 million tokens / 425 million words |
yes |
2013 |
subcorpus of journalistic (informational) texts |
word, lemma, tag, prec |
doc, s, p, g |
|
106 million tokens / 84 million words |
yes |
2013 |
subcorpus of scientific, professional and popular science texts |
word, lemma, tag, prec |
doc, s, p, g |
|
114 million tokens / 91 million words |
yes |
2013 |
subcorpus of fiction texts |
word, lemma, tag, prec |
doc, s, p, g |
|
558 million tokens / 441 million words |
yes |
2013 |
subcorpus of original texts written in Slovak |
word, lemma, tag, prec |
doc, s, p, g |
|
35 million tokens / 28 million words |
yes |
2013 |
subcorpus of original Slovak fiction texts |
word, lemma, tag, prec |
doc, s, p, g |
|
63 million tokens / 51 million words |
yes |
2013 |
specific corpus of texts from years 1955–1989 (11.9 % journalistic, 55.5 % fiction, 24.1 % professional and 8.5 % other texts) |
word, lemma, tag, prec |
doc, s, p, g |
|
1 155 million tokens / 939 million words |
yes |
2013 |
all publicly available SNC texts (77.8 % journalistic, 9.8 % fiction, 11 % professional, 1.4 % other texts) |
word, lemma, tag, prec |
doc, s, p, g |
|
719 million tokens / 599 million words |
yes |
2011 |
all publicly available SNC texts (73 % journalistic, 14 % fiction, 12 % professional, 1 % other texts) |
word, lemma, tag, prec |
doc, s, p, br, noise, picture, head, hi, equation, table |
|
526 million tokens / 429 million words |
yes |
2009 |
all publicly available SNC texts (65 % journalistic, 17 % fiction, 16 % professional, 2 % other texts) |
word, lemma, tag, prec |
doc, s, p, br, noise, picture, head, hi, equation, table |
|
339 million tokens / 276 million words |
yes |
2007 |
all publicly available SNC texts (57 % journalistic, 21.5 % fiction, 18.5 % professional, 3 % other texts) |
word, lemma, tag, hlemma, htag |
doc, s, p, br, noise, picture, head, hi, equation, table |
|
294 million tokens / 240 million words |
yes |
2006 |
all publicly available SNC texts (63 % journalistic, 20 % fiction, 12 % professional, 5 % other texts) |
word, lemma, tag, hlemma, htag |
doc, s, p, br, noise, picture, head, hi, equation, table |
|
4 042 million tokens / 3 326 million words |
yes |
2020 |
corpus of Slovak texts available on the web |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, p, s, g, pgap, sgap |
|
2 963 million tokens / 2 440 million words |
yes |
2018 |
corpus of Slovak texts available on the web |
word, lemma, tag, prec, word_lc, lemma_lc |
doc, p, s, g, pgap, sgap |
|
2 372 million tokens / 1 993 million words |
yes |
2015 |
corpus of Slovak texts available on the web |
word, lemma, tag, prec |
doc, p, s, g, gap |
|
51 million tokens / 38 million words |
yes |
2020 |
corpus of texts from Slovak Wikipédia |
word, lemma, tag, prec |
doc, s, p, m, g |
|
47 million tokens / 35 million words |
yes |
2018 |
corpus of texts from Slovak Wikipédia and Necyklopédia |
word, lemma, tag, prec |
doc, s, p |
|
45 million tokens / 34 million words |
yes |
2017 |
corpus of texts from Slovak Wikipédia and Necyklopédia |
word, lemma, tag, prec |
doc, s, p |
|
43 million tokens / 34 million words |
yes |
2016 |
corpus of texts from Slovak Wikipédia and Necyklopédia |
word, lemma, tag, prec |
doc, s, p |
|
40 million tokens / 32 million words |
yes |
2015 |
corpus of texts from Slovak Wikipédia and Necyklopédia |
word, lemma, tag, prec |
doc, s, p |
|
253 million tokens / 203 million words |
yes |
2018 |
The reference corpus prim-7.0-frk was the source for Frekvenčný slovník slovenčiny na báze Slovenského národného korpusu (Slovak Frequency Dictionary Based on the Slovak National Corpus), as well as for the examples listed in the publication Skloňovanie podstatných mien v slovenčine s korpusovými príkladmi (Declension of the Slovak Nouns with Corpus Examples). |
word, lemma, tag, prec |
doc, s, p, g |
|
1.2 million tokens / 978 000 words |
yes |
2017 |
manually morphologically annotated corpus (30.6 % journalistic, 50.2 % fiction, 19.2 % professional texts) |
word, lemma, tag |
doc, s, p, br, noise, picture, head, hi, equation, table |
|
1.2 million tokens / 978 000 words |
yes |
2016 |
manually morphologically annotated corpus (28.5 % journalistic, 44.5 % fiction, 27 % professional texts) |
word, lemma, tag |
doc, s, p, br, noise, picture, head, hi, equation, table |
|
1.2 million tokens / 977 000 words |
yes |
2013 |
manually morphologically annotated corpus (36.2 % journalistic, 44.9 % fiction, 18.9 % professional texts) |
word, lemma, tag |
doc, s, p, hi |
2. Written corpora − synchronous, specialised
corpus |
size − number of tokens / number of words |
lemmatisation, morphological annotation |
year of release |
characteristics |
attributes |
structures |
66 million tokens / 54 million words |
yes |
2014 |
corpus of religious texts |
word, lemma, tag, prec |
doc, s, p, g |
|
1.6 million tokens / 1.2 million words |
yes |
2014 |
corpus of copywrighting texts |
word, lemma, tag, prec |
doc, s, p, g |
|
165 million tokens / 140 million words |
yes |
2016 |
corpus of economic texts (3.76 % professional and 96.24 % journalistic texts from the field of economics, banking, trade, management and merchandising) |
word, lemma, tag, prec |
doc, s, p, g |
|
20 million tokens / 17 million words |
yes |
2014 |
corpus of economic texts (81.4 % professional and 18.6 % journalistic texts from the field of economics, banking, trade, management and merchandising) |
word, lemma, tag, prec |
doc, s, p, g |
|
39 million tokens / 30 million words |
yes |
2016 |
corpus of humanistic texts |
word, lemma, tag, prec |
doc, s, p, g |
|
1.5 million tokens / 1.3 million words |
yes |
2015 |
corpus of judicial decisions |
word, lemma, tag, prec |
doc, s, p |
|
49 million tokens / 40 million words |
yes |
|
corpus of legal texts (deduplicated) |
word, lemma, tag, ftag, rgtag |
doc, p, s, s0, g |
|
147 million tokens / 114 million words |
yes |
2011 |
corpus of legal texts |
|
|
3. Written corpora − parallel
corpus |
size − number of tokens / number of words |
lemmatisation, morphological annotation |
year of release (first version released in) |
characteristics |
attributes |
structures |
163 million tokens / 108 million words |
yes, |
2014 |
Slovak-Bulgarian parallel corpus: 78 million tokens in Slovak half, 85 million tokens in Bulgarian half |
word, lemma, tag |
doc, s |
|
418 million tokens / 306 million words |
yes, |
2016 |
Slovak-Czech parallel corpus: 209 million tokens in Slovak half, 209 million tokens in Czech half |
word, lemma, tag |
doc, s |
|
31.5 million tokens / 25.0 million words |
yes, |
2018 |
Slovak-Czech parallel corpus, subcorpus fiction: 15.7 million tokens in Slovak half, 15.8 million tokens in Czech half |
word, lemma, tag |
doc, s |
|
446 million tokens / 300 million words |
yes, |
2016 |
Slovak-German parallel corpus: 220 million tokens in Slovak half, 226 million tokens in German half |
word, lemma, tag |
doc, s |
|
556 million tokens / 436 million words |
yes, |
2015 |
Slovak-English parallel corpus: 261 million tokens in Slovak half, 295 million tokens in English half |
word, lemma, tag |
doc, s |
|
449 million tokens / 332 million words |
yes, |
2016 |
Slovak-French parallel corpus: 217 million tokens in Slovak half, 232 million tokens in French half |
word, lemma, tag |
doc, s |
|
99 million tokens / 75 million words |
yes, |
2015 |
Slovak-Hungarian parallel corpus: 51 million tokens in Slovak half, 48 million tokens in Hungarian half |
word, lemma, tag |
doc, s |
|
3.9 million tokens |
yes, |
2015 |
Slovak-Hungarian parallel corpus: 2.0 million tokens in Slovak half, 1.9 million tokens in Hungarian half |
word, lemma, tag |
doc, s |
|
5.0 million tokens / 4.1 million words |
yes, |
2018 |
Slovak-Latin parallel corpus: 2.7 million tokens in Slovak half, 2.3 million tokens in Latin half |
word, lemma, tag |
doc, s |
|
1.3 million tokens / 1.0 million words |
yes, |
2017 |
Slovak-Romanian parallel corpus: 603 000 tokens in Slovak half, 689 000 tokens in Romanian half |
word, lemma, tag |
doc, s |
|
8.2 million tokens / 6.5 million words |
yes, |
2018 |
Slovak-Polish parallel corpus: 4.1 mil. tokens in Slovak half, 4.1 million tokens in Polish half |
word, lemma, tag |
doc, s |
|
8.5 million tokens / 6.6 million words |
yes, |
2014 |
Slovak-Russian parallel corpus: 4.2 mil. tokens in Slovak half, 4.2 million tokens in Russian half |
word, lemma, tag |
doc, s |
4. Written corpora of texts before the year 1955
corpus |
size − number of tokens / number of words |
lemmatisation, morphological annotation |
year of release |
characteristics |
attributes |
structures |
2.1 million tokens / 1.6 million words |
no |
2015 |
corpus of texts from 864–1843 |
word |
doc, s, p, g |
|
24 million tokens / 19 million words |
nie |
2015 |
corpus of texts from 1843–1954 |
word |
doc, s, p, g |
5. Historical corpus
corpus |
size − number of tokens / number of words |
lemmatisation, morphological annotation |
year of release |
characteristics |
attributes |
structures |
998 000 tokens / 731 000 words |
no |
2020 |
Corpus of historical Slovak |
word, lemma |
doc, s, p, g, noise, rem, miss |
|
918 000 tokens / 668 000 words |
no |
2016 |
Corpus of historical Slovak |
word, lemma |
doc, s, p, g |
|
836 000 tokens / 600 000 words |
no |
2015 |
Corpus of historical Slovak |
word, lemma |
doc, s, p, g |
|
552 000 tokens / 422 000 words |
no |
2014 |
Corpus of historical Slovak |
word, lemma |
doc, s, p, g |
|
371 000 tokens |
no |
2012 |
Corpus of historical Slovak |
word, nword |
doc, s, p, g |
6. Spoken corpora − synchronous, standard
corpus |
size − number of tokens / number of words |
lemmatisation, morphological annotation |
year of release |
characteristics |
attributes |
structures |
6.6 million tokens / 5.5 million words |
yes |
2017 |
Corpus of spoken Slovak |
word, pron, lemma, tag, prec |
||
5.7 million tokens / 4.7 million words |
yes |
2015 |
Corpus of spoken Slovak |
word, pron, lemma, tag, prec |
doc, section, turn, event, sync, background, who, spk |
|
2.6 million tokens / 2.2 million words |
yes |
2012 |
Corpus of spoken Slovak |
word, pron, lemma, tag, prec |
doc, section, turn, event, sync, background, who, spk |
|
2.1 million tokens / 1.4 million words |
yes |
2011 |
Corpus of spoken Slovak |
word, pron, lemma, tag, dcount |
doc, section, turn, event, sync, background, who |
|
679 000 tokens / 561 000 words |
yes |
2010 |
Corpus of spoken Slovak |
word, pron, lemma, tag, dcount |
doc, section, turn, event, sync, background, who |
|
128 000 tokens / 104 000 words |
yes |
2008 |
Corpus of spoken Slovak |
word, pron, lemma, tag, dcount |
doc, section, turn, event, sync, background, who |
7. Corpora of dialects of the SNC
corpus |
size − number of tokens / number of words |
lemmatisation, morphological annotation |
year of release |
characteristics |
attributes |
structures |
712 000 tokens / 571 000 words |
no |
2020 |
Corpora of dialects of the SNC |
word, lemma |
doc, spk, s, p, rem |
|
495 000 tokens / 403 000 words |
no |
2016 |
Corpora of dialects of the SNC |
word, lemma |
doc, spk, s, p, rem |
|
329 000 tokens / 252 000 words |
no |
2015 |
Corpora of dialects of the SNC |
word, lemma |
doc, spk, s, p, rem |
|
74 000 tokens / 55 000 words |
no |
2014 |
Corpora of dialects of the SNC |
word, lemma |
doc, s, p |