Attributes and structural values in the publicly available SNC corpora

Obsah

Written corpora − synchronous, general
Written corpora − synchronous, specialised
Written corpora − parallel
Written corpora of texts before the year 1955
Historical corpus
Spoken corpora − synchronous, standard
Corpora of dialects of the SNC

1. Written corpora − synchronous, general

corpus	size − number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics	attributes	structures
prim-10.0-public-all	1 688 million tokens / 1 355 million words	yes	2020	all publicly available texts in SNC (71.0 % journalistic, 16.8 % fiction, 11.3 % professional and 0.9 % other texts)	word, lemma, tag, word_lc lemma_lc	doc, s, p, g, noise, hi
prim-10.0-public-sane	1 650 million tokens / 1 323 million words	yes	2020	corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals	word, lemma, tag, word_lc, lemma_lc	doc, s, p, g, noise, hi
prim-10.0-public-vyv	572 million tokens / 459 million words	yes	2020	balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)	word, lemma, tag, word_lc, lemma_lc	doc, s, p, g, noise, hi
prim-10.0-public-inf	1 163 million tokens / 932 million words	yes	2020	subcorpus of journalistic (informational) texts	word, lemma, tag, word_lc, lemma_lc	doc, s, p, g, noise, hi
prim-10.0-public-prf	189 million tokens / 153 million words	yes	2020	subcorpus of scientific, professional and popular science texts	word, lemma, tag, word_lc, lemma_lc	doc, s, p, g, hi
prim-10.0-public-img	283 million tokens / 226 million words	yes	2020	subcorpus of fiction texts	word, lemma, tag, word_lc, lemma_lc	doc, s, p, g, hi
prim-10.0-public-sk	1 361 million tokens / 1 093 million words	yes	2020	subcorpus of original texts written in Slovak	word, lemma, tag, word_lc, lemma_lc	doc, s, p, g, hi
prim-10.0-public-img-sk	97 million tokens / 78 million words	yes	2020	subcorpus of original fiction texts written in Slovak	word, lemma, tag, word_lc, lemma_lc	doc, s, p, g, hi
r1955az1989-7.0	109 million tokens / 87 million words	yes	2020	specific corpus of texts from years 1955–1989 (4.0 % journalistic, 81.2 % fiction, 11.1 % professional and 3.7 % other texts)	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
prim-9.0-public-all	1 652 million tokens / 1 282 million words	yes	2020	all publicly available texts in SNC (74.0 % journalistic, 16.0 % fiction, 9.2 % professional and 0.9 % other texts)	word, lemma, tag, prec, word_lc lemma_lc	doc, s, p, g, noise, hi
prim-9.0-public-sane	1 621 million tokens / 1 257 million words	yes	2020	corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, noise, hi
prim-9.0-public-vyv	454 million tokens / 355 million words	yes	2020	balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, noise, hi
prim-9.0-public-inf	1 194 million tokens / 920 million words	yes	2020	subcorpus of journalistic (informational) texts	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, noise, hi
prim-9.0-public-prf	150 million tokens / 117 million words	yes	2020	subcorpus of scientific, professional and popular science texts	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
prim-9.0-public-img	263 million tokens / 208 million words	yes	2020	subcorpus of fiction texts	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
prim-9.0-public-sk	1 258 million tokens / 977 million words	yes	2020	subcorpus of original texts written in Slovak	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
prim-9.0-public-img-sk	93 million tokens / 74 million words	yes	2020	subcorpus of original fiction texts written in Slovak	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
r1955az1989-6.0	99 million tokens / 79 million words	yes	2020	specific corpus of texts from years 1955–1989 (4.5 % journalistic, 78.6 % fiction, 12.4 % professional and 4.4 % other texts)	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
prim-8.0-public-all	1 477 million tokens / 1 160 million words	yes	2018	all publicly available texts in SNC (71.1 % journalistic, 15.4 % fiction, 8.5 % professional and 5.0 % other texts)	word, lemma, tag, prec, word_lc lemma_lc	doc, s, p, g, noise, hi
prim-8.0-public-sane	1 369 million tokens / 1 076 million words	yes	2018	corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, noise, hi
prim-8.0-public-vyv	377 million tokens / 298 million words	yes	2018	balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, noise, hi
prim-8.0-public-inf	1 010 million tokens / 791 million words	yes	2018	subcorpus of journalistic (informational) texts	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, noise, hi
prim-8.0-public-prf	122 million tokens / 96 million words	yes	2018	subcorpus of scientific, professional and popular science texts	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
prim-8.0-public-img	224 million tokens / 178 million words	yes	2018	subcorpus of fiction texts	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
prim-8.0-public-sk	1 043 million tokens / 822 million words	yes	2018	subcorpus of original texts written in Slovak	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
prim-8.0-public-img-sk	83 million tokens / 66 million words	yes	2018	subcorpus of original fiction texts written in Slovak	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
r1955az1989-5.0	84 million tokens / 67 million words	yes	2018	specific corpus of texts from years 1955–1989 (5.3 % journalistic, 75.3 % fiction, 14.0 % professional and 5.4 % other texts)	word, lemma, tag, prec, word_lc, lemma_lc	doc, s, p, g, hi
prim-7.0-public-all	1 250 million tokens / 972 million words	yes	2015	all publicly available texts in SNC (65.1 % journalistic, 15.1 % fiction, 9.5 % professional and 10.3 % other texts)	word, lemma, tag, prec	doc, s, p, g
prim-7.0-public-sane	1 089 million tokens / 849 million words	yes	2015	corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals	word, lemma, tag, prec	doc, s, p, g
prim-7.0-public-vyv	341 million tokens / 267 million words	yes	2015	balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)	word, lemma, tag, prec	doc, s, p, g
prim-7.0-public-inf	771 million tokens / 597 million words	yes	2015	subcorpus of journalistic (informational) texts	word, lemma, tag, prec	doc, s, p, g
prim-7.0-public-prf	114 million tokens / 89 million words	yes	2015	subcorpus of scientific, professional and popular science texts	word, lemma, tag, prec	doc, s, p, g
prim-7.0-public-img	188 million tokens / 149 million words	yes	2015	subcorpus of fiction texts	word, lemma, tag, prec	doc, s, p, g
prim-7.0-public-sk	807 million tokens / 630 million words	yes	2015	subcorpus of original texts written in Slovak	word, lemma, tag, prec	doc, s, p, g
prim-7.0-public-img-sk	65 million tokens / 52 million words	yes	2015	subcorpus of original fiction texts written in Slovak	word, lemma, tag, prec	doc, s, p, g
r1955az1989-4.0	67 million tokens / 54 million words	yes	2015	specific corpus of texts from years 1955–1989 (7.4 % journalistic, 69.3 % fiction, 16.6 % professional and 6.7 % other texts)	word, lemma, tag, prec	doc, s, p, g
prim-6.1-public-all	830 million tokens / 656 million words	yes	2013	all publicly available SNC texts (68.8 % journalistic, 13.9 % fiction, 15.3 % professional and 2 % other texts)	word, lemma, tag, prec	doc, s, p, g
prim-6.1-public-sane	773 million tokens / 610 million words	yes	2013	corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals	word, lemma, tag, prec	doc, s, p, g
prim-6.1-public-vyv	317 million tokens / 252 million words	yes	2013	balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)	word, lemma, tag, prec	doc, s, p, g
prim-6.1-public-inf	541 million tokens / 425 million words	yes	2013	subcorpus of journalistic (informational) texts	word, lemma, tag, prec	doc, s, p, g
prim-6.1-public-prf	106 million tokens / 84 million words	yes	2013	subcorpus of scientific, professional and popular science texts	word, lemma, tag, prec	doc, s, p, g
prim-6.1-public-img	114 million tokens / 91 million words	yes	2013	subcorpus of fiction texts	word, lemma, tag, prec	doc, s, p, g
prim-6.1-public-sk	558 million tokens / 441 million words	yes	2013	subcorpus of original texts written in Slovak	word, lemma, tag, prec	doc, s, p, g
prim-6.1-public-img-sk	35 million tokens / 28 million words	yes	2013	subcorpus of original Slovak fiction texts	word, lemma, tag, prec	doc, s, p, g
r55az89-3.0	63 million tokens / 51 million words	yes	2013	specific corpus of texts from years 1955–1989 (11.9 % journalistic, 55.5 % fiction, 24.1 % professional and 8.5 % other texts)	word, lemma, tag, prec	doc, s, p, g
prim-6.0-public-all	1 155 million tokens / 939 million words	yes	2013	all publicly available SNC texts (77.8 % journalistic, 9.8 % fiction, 11 % professional, 1.4 % other texts)	word, lemma, tag, prec	doc, s, p, g
prim-5.0-public-all	719 million tokens / 599 million words	yes	2011	all publicly available SNC texts (73 % journalistic, 14 % fiction, 12 % professional, 1 % other texts)	word, lemma, tag, prec	doc, s, p, br, noise, picture, head, hi, equation, table
prim-4.0-public-all	526 million tokens / 429 million words	yes	2009	all publicly available SNC texts (65 % journalistic, 17 % fiction, 16 % professional, 2 % other texts)	word, lemma, tag, prec	doc, s, p, br, noise, picture, head, hi, equation, table
prim-3.0-public-all	339 million tokens / 276 million words	yes	2007	all publicly available SNC texts (57 % journalistic, 21.5 % fiction, 18.5 % professional, 3 % other texts)	word, lemma, tag, hlemma, htag	doc, s, p, br, noise, picture, head, hi, equation, table
prim-2.1-public-all	294 million tokens / 240 million words	yes	2006	all publicly available SNC texts (63 % journalistic, 20 % fiction, 12 % professional, 5 % other texts)	word, lemma, tag, hlemma, htag	doc, s, p, br, noise, picture, head, hi, equation, table
web-5.0	4 042 million tokens / 3 326 million words	yes	2020	corpus of Slovak texts available on the web	word, lemma, tag, prec, word_lc, lemma_lc	doc, p, s, g, pgap, sgap
web-4.0	2 963 million tokens / 2 440 million words	yes	2018	corpus of Slovak texts available on the web	word, lemma, tag, prec, word_lc, lemma_lc	doc, p, s, g, pgap, sgap
web-3.0	2 372 million tokens / 1 993 million words	yes	2015	corpus of Slovak texts available on the web	word, lemma, tag, prec	doc, p, s, g, gap
wiki-2019-08	51 million tokens / 38 million words	yes	2020	corpus of texts from Slovak Wikipédia	word, lemma, tag, prec	doc, s, p, m, g
wiki-2018-03	47 million tokens / 35 million words	yes	2018	corpus of texts from Slovak Wikipédia and Necyklopédia	word, lemma, tag, prec	doc, s, p
wiki-2017-02	45 million tokens / 34 million words	yes	2017	corpus of texts from Slovak Wikipédia and Necyklopédia	word, lemma, tag, prec	doc, s, p
wiki-2016-02	43 million tokens / 34 million words	yes	2016	corpus of texts from Slovak Wikipédia and Necyklopédia	word, lemma, tag, prec	doc, s, p
wiki-2015-02	40 million tokens / 32 million words	yes	2015	corpus of texts from Slovak Wikipédia and Necyklopédia	word, lemma, tag, prec	doc, s, p
prim-7.0-frk	253 million tokens / 203 million words	yes	2018	The reference corpus prim-7.0-frk was the source for Frekvenčný slovník slovenčiny na báze Slovenského národného korpusu (Slovak Frequency Dictionary Based on the Slovak National Corpus), as well as for the examples listed in the publication Skloňovanie podstatných mien v slovenčine s korpusovými príkladmi (Declension of the Slovak Nouns with Corpus Examples).	word, lemma, tag, prec	doc, s, p, g
r-mak-6.0	1.2 million tokens / 978 000 words	yes	2017	manually morphologically annotated corpus (30.6 % journalistic, 50.2 % fiction, 19.2 % professional texts)	word, lemma, tag	doc, s, p, br, noise, picture, head, hi, equation, table
r-mak-5.0	1.2 million tokens / 978 000 words	yes	2016	manually morphologically annotated corpus (28.5 % journalistic, 44.5 % fiction, 27 % professional texts)	word, lemma, tag	doc, s, p, br, noise, picture, head, hi, equation, table
r-mak-4.0	1.2 million tokens / 977 000 words	yes	2013	manually morphologically annotated corpus (36.2 % journalistic, 44.9 % fiction, 18.9 % professional texts)	word, lemma, tag	doc, s, p, hi

2. Written corpora − synchronous, specialised

corpus	size − number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics	attributes	structures
blf-2.0	66 million tokens / 54 million words	yes	2014	corpus of religious texts	word, lemma, tag, prec	doc, s, p, g
cw-2014-all	1.6 million tokens / 1.2 million words	yes	2014	corpus of copywrighting texts	word, lemma, tag, prec	doc, s, p, g
ecn-2.0-public	165 million tokens / 140 million words	yes	2016	corpus of economic texts (3.76 % professional and 96.24 % journalistic texts from the field of economics, banking, trade, management and merchandising)	word, lemma, tag, prec	doc, s, p, g
ecn-1.0-public	20 million tokens / 17 million words	yes	2014	corpus of economic texts (81.4 % professional and 18.6 % journalistic texts from the field of economics, banking, trade, management and merchandising)	word, lemma, tag, prec	doc, s, p, g
hum-1.0-public	39 million tokens / 30 million words	yes	2016	corpus of humanistic texts	word, lemma, tag, prec	doc, s, p, g
judikat-1.0	1.5 million tokens / 1.3 million words	yes	2015	corpus of judicial decisions	word, lemma, tag, prec	doc, s, p
legal-1.1	49 million tokens / 40 million words	yes		corpus of legal texts (deduplicated)	word, lemma, tag, ftag, rgtag	doc, p, s, s0, g
legal-1.0	147 million tokens / 114 million words	yes	2011	corpus of legal texts

3. Written corpora − parallel

corpus	size − number of tokens / number of words	lemmatisation, morphological annotation	year of release (first version released in)	characteristics	attributes	structures
par-skbg-free-0.1	163 million tokens / 108 million words	yes, both languages	2014 (2014)	Slovak-Bulgarian parallel corpus: 78 million tokens in Slovak half, 85 million tokens in Bulgarian half	word, lemma, tag	doc, s
par-skcs-all-4.0	418 million tokens / 306 million words	yes, both languages	2016 (2010)	Slovak-Czech parallel corpus: 209 million tokens in Slovak half, 209 million tokens in Czech half	word, lemma, tag	doc, s
par-skcs-fic-5.0	31.5 million tokens / 25.0 million words	yes, both languages	2018 (2010)	Slovak-Czech parallel corpus, subcorpus fiction: 15.7 million tokens in Slovak half, 15.8 million tokens in Czech half	word, lemma, tag	doc, s
par-skde-all-2.0	446 million tokens / 300 million words	yes, both languages	2016 (2014)	Slovak-German parallel corpus: 220 million tokens in Slovak half, 226 million tokens in German half	word, lemma, tag	doc, s
par-sken-4.0	556 million tokens / 436 million words	yes, both languages	2015 (2010)	Slovak-English parallel corpus: 261 million tokens in Slovak half, 295 million tokens in English half	word, lemma, tag	doc, s
par-skfr-all-3.0	449 million tokens / 332 million words	yes, both languages	2016 (2006)	Slovak-French parallel corpus: 217 million tokens in Slovak half, 232 million tokens in French half	word, lemma, tag	doc, s
par-skhu-1.0	99 million tokens / 75 million words	yes, both languages	2015 (2014)	Slovak-Hungarian parallel corpus: 51 million tokens in Slovak half, 48 million tokens in Hungarian half	word, lemma, tag	doc, s
par-skhu-0.2	3.9 million tokens	yes, both languages	2015 (2014)	Slovak-Hungarian parallel corpus: 2.0 million tokens in Slovak half, 1.9 million tokens in Hungarian half	word, lemma, tag	doc, s
par-skla-3.0	5.0 million tokens / 4.1 million words	yes, both languages	2018 (2012)	Slovak-Latin parallel corpus: 2.7 million tokens in Slovak half, 2.3 million tokens in Latin half	word, lemma, tag	doc, s
par-skro-1.1	1.3 million tokens / 1.0 million words	yes, both languages	2017 (2016)	Slovak-Romanian parallel corpus: 603 000 tokens in Slovak half, 689 000 tokens in Romanian half	word, lemma, tag	doc, s
par-skpl-1.0	8.2 million tokens / 6.5 million words	yes, both languages	2018 (2018)	Slovak-Polish parallel corpus: 4.1 mil. tokens in Slovak half, 4.1 million tokens in Polish half	word, lemma, tag	doc, s
par-skru-2.0	8.5 million tokens / 6.6 million words	yes, both languages	2014 (2012)	Slovak-Russian parallel corpus: 4.2 mil. tokens in Slovak half, 4.2 million tokens in Russian half	word, lemma, tag	doc, s

4. Written corpora of texts before the year 1955

corpus	size − number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics	attributes	structures
r864az1843-1.0	2.1 million tokens / 1.6 million words	no	2015	corpus of texts from 864–1843	word	doc, s, p, g
r1843az1954-1.0	24 million tokens / 19 million words	nie	2015	corpus of texts from 1843–1954	word	doc, s, p, g

5. Historical corpus

corpus	size − number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics	attributes	structures
hist-5.0	998 000 tokens / 731 000 words	no	2020	Corpus of historical Slovak	word, lemma	doc, s, p, g, noise, rem, miss
hist-4.0	918 000 tokens / 668 000 words	no	2016	Corpus of historical Slovak	word, lemma	doc, s, p, g
hist-3.0	836 000 tokens / 600 000 words	no	2015	Corpus of historical Slovak	word, lemma	doc, s, p, g
hist-2.0	552 000 tokens / 422 000 words	no	2014	Corpus of historical Slovak	word, lemma	doc, s, p, g
hist-1.0	371 000 tokens	no	2012	Corpus of historical Slovak	word, nword	doc, s, p, g

6. Spoken corpora − synchronous, standard

corpus	size − number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics	attributes	structures
s-hovor-6.0	6.6 million tokens / 5.5 million words	yes	2017	Corpus of spoken Slovak	word, pron, lemma, tag, prec	structures for s-hovor-6.0
s-hovor-5.0	5.7 million tokens / 4.7 million words	yes	2015	Corpus of spoken Slovak	word, pron, lemma, tag, prec	doc, section, turn, event, sync, background, who, spk
s-hovor-4.0	2.6 million tokens / 2.2 million words	yes	2012	Corpus of spoken Slovak	word, pron, lemma, tag, prec	doc, section, turn, event, sync, background, who, spk
s-hovor-3.0	2.1 million tokens / 1.4 million words	yes	2011	Corpus of spoken Slovak	word, pron, lemma, tag, dcount	doc, section, turn, event, sync, background, who
s-hovor-2.0	679 000 tokens / 561 000 words	yes	2010	Corpus of spoken Slovak	word, pron, lemma, tag, dcount	doc, section, turn, event, sync, background, who
s-hovor-1.0	128 000 tokens / 104 000 words	yes	2008	Corpus of spoken Slovak	word, pron, lemma, tag, dcount	doc, section, turn, event, sync, background, who

7. Corpora of dialects of the SNC

corpus	size − number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics	attributes	structures
dialekt-4.0	712 000 tokens / 571 000 words	no	2020	Corpora of dialects of the SNC	word, lemma	doc, spk, s, p, rem
dialekt-3.0	495 000 tokens / 403 000 words	no	2016	Corpora of dialects of the SNC	word, lemma	doc, spk, s, p, rem
dialekt-2.0	329 000 tokens / 252 000 words	no	2015	Corpora of dialects of the SNC	word, lemma	doc, spk, s, p, rem
dialekt-1.0	74 000 tokens / 55 000 words	no	2014	Corpora of dialects of the SNC	word, lemma	doc, s, p

Slovak National Corpus

Attributes and structural values in the publicly available SNC corpora

1. Written corpora − synchronous, general

2. Written corpora − synchronous, specialised

3. Written corpora − parallel

4. Written corpora of texts before the year 1955

5. Historical corpus

6. Spoken corpora − synchronous, standard

7. Corpora of dialects of the SNC