→po slovensky

Publicly available SNC corpora

1. Written corpora − synchronous, general

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

prim-10.0-juls-all

1 961 million tokens / 1 572 million words

yes


internal corpus

monolingual corpus, comprised of all texts published or written after the year 1955

prim-10.0-public-all

1 688 million tokens / 1 355 million words

yes

2022

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search (71.0 % journalistic, 16.8 % fiction, 11.3 % professional and 0.9 % other texts)

prim-10.0-juls-sane

1 921 million tokens / 1 540 million words

yes


internal corpus

monolingual corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc.

prim-10.0-public-sane

1 650 million tokens / 1 323 million words

yes

2022

monolingual corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora

prim-10.0-public-vyv

572 million tokens / 459 million words

yes

2022

subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)

prim-10.0-public-inf

1 163 million tokens / 932 million words

yes

2022

subcorpus of journalistic (informational) texts

prim-10.0-public-prf

189 million tokens / 153 million words

yes

2022

subcorpus of scientific, professional and popular science texts

prim-10.0-public-img

283 million tokens / 226 million words

yes

2022

subcorpus of fiction texts

prim-10.0-public-sk

1 361 million tokens / 1 093 million words

yes

2022

subcorpus of original texts written in Slovak

prim-10.0-public-img-sk

97 million tokens / 78 million words

yes

2022

subcorpus of original fiction texts written in Slovak

r1955az1989-7.0

109 million tokens / 87 million words

yes

2022

subcorpus of texts from years 1955–1989 (4.0 % journalistic, 81.2 % fiction, 11.1 % professional and 3.7 % other texts)

prim-9.0-juls-all

1 870 million tokens / 1 455 million words

yes


internal corpus

monolingual corpus, comprised of all texts published or written after the year 1955

prim-9.0-public-all

1 652 million tokens / 1 282 million words

yes

2020

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search (74.0 % journalistic, 16.0 % fiction, 9.2 % professional and 0.9 % other texts)

prim-9.0-juls-sane

1 838 million tokens / 1 429 million words

yes


internal corpus

monolingual corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc.

prim-9.0-public-sane

1 621 million tokens / 1 257 million words

yes

2020

monolingual corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora

prim-9.0-public-vyv

454 million tokens / 355 million words

yes

2020

subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)

prim-9.0-public-inf

1 194 million tokens / 920 million words

yes

2020

subcorpus of journalistic (informational) texts

prim-9.0-public-prf

150 million tokens / 117 million words

yes

2020

subcorpus of scientific, professional and popular science texts

prim-9.0-public-img

263 million tokens / 208 million words

yes

2020

subcorpus of fiction texts

prim-9.0-public-sk

1 258 million tokens / 977 million words

yes

2020

subcorpus of original texts written in Slovak

prim-9.0-public-img-sk

93 million tokens / 74 million words

yes

2020

subcorpus of original fiction texts written in Slovak

r1955az1989-6.0

99 million tokens / 79 million words

yes

2020

subcorpus of texts from years 1955–1989 (4.5 % journalistic, 78.6 % fiction, 12.4 % professional and 4.4 % other texts)

prim-8.0-juls-all

1 647 million tokens / 1 295 million words

yes


internal corpus

monolingual corpus, comprised of all texts published or written after the year 1955

prim-8.0-public-all

1 477 million tokens / 1 160 million words

yes

2018

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search (71.1 % journalistic, 15.4 % fiction, 8.5 % professional and 5.0 % other texts)

prim-8.0-juls-sane

1 518 million tokens / 1 195 million words

yes


internal corpus

monolingual corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc.

prim-8.0-public-sane

1 369 million tokens / 1 076 million words

yes

2018

monolingual corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora

prim-8.0-public-vyv

377 million tokens / 298 million words

yes

2018

subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)

prim-8.0-public-inf

1 010 million tokens / 791 million words

yes

2018

subcorpus of journalistic (informational) texts

prim-8.0-public-prf

122 million tokens / 96 million words

yes

2018

subcorpus of scientific, professional and popular science texts

prim-8.0-public-img

224 million tokens / 178 million words

yes

2018

subcorpus of fiction texts

prim-8.0-public-sk

1 043 million tokens / 822 million words

yes

2018

subcorpus of original texts written in Slovak

prim-8.0-public-img-sk

83 million tokens / 66 million words

yes

2018

subcorpus of original fiction texts written in Slovak

r1955az1989-5.0

84 million tokens / 67 million words

yes

2018

subcorpus of texts from years 1955–1989 (5.3 % journalistic, 75.3 % fiction, 14.0 % professional and 5.4 % other texts)

prim-7.0-juls-all

1 437 million tokens / 1 119 million words

yes


internal corpus

monolingual corpus, comprised of all texts published or written after the year 1955

prim-7.0-public-all

1 250 million tokens / 972 million words

yes

2015

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search (65.1 % journalistic, 15.1 % fiction, 9.5 % professional and 10.3 % other texts)

prim-7.0-juls-sane

1 202 million tokens / 938 million words

yes


internal corpus

monolingual corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc.

prim-7.0-public-sane

1 089 million tokens / 849 million words

yes

2015

monolingual corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. - the corpus is further divided into subcorpora

prim-7.0-public-vyv

341 million tokens / 267 million words

yes

2015

subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)

prim-7.0-public-inf

771 million tokens / 597 million words

yes

2015

subcorpus of journalistic texts

prim-7.0-public-prf

114 million tokens / 89 million words

yes

2015

subcorpus of scientific, professional and popular science texts

prim-7.0-public-img

188 million tokens / 149 million words

yes

2015

subcorpus of fiction texts

prim-7.0-public-sk

807 million tokens / 630 million words

yes

2015

subcorpus of original texts written in Slovak

prim-7.0-public-img-sk

65 million tokens / 52 million words

yes

2015

subcorpus of original fiction texts written in Slovak

r1955az1989-4.0

67 million tokens / 54 million words

yes

2015

subcorpus of texts from years 1955–1989 (7.4 % journalistic, 69.3 % fiction, 16.6 % professional and 6.7 % other texts)

prim-6.1-public-all

830 million tokens / 656 million words

yes

2013

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search (68.8 % journalistic, 13.9 % fiction, 15.3 % professional and 2.0 % other texts)

r55az89-3.0

63 million tokens / 51 million words

yes

2013

subcorpus of texts from years 1955–1989 (11.9 % journalistic, 55.5 % fiction, 24.1 % professional and 8.5 % other texts)

prim-6.0-public-all

1 155 million tokens / 939 million words

yes

2013

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search (77.8 % journalistic, 9.8 % fiction, 11.0 % professional and 1.4 % other texts)

prim-5.0-public-all

719 million tokens / 599 million words

yes

2011

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search (73 % journalistic, 14 % fiction, 12 % professional and 1 % other texts)

r55az89-2.0

44 million tokens / 35 million words

yes

2011

subcorpus of texts from years 1955–1989

prim-4.0-public-all

526 million tokens / 429 million words

yes

2009

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search (65 % journalistic, 17 % fiction, 16 % professional and 2 % other texts)

r55az89-1.0

40 million tokens / 32 million words

yes

2009

subcorpus of texts from years 1955–1989

prim-3.0-public-all

339 million tokens / 276 million words

yes

2007

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search (57 % journalistic, 21.5 % fiction, 18.5 % professional and 3 % other texts)

prim-2.1-public-all

294 million tokens / 240 million words

yes

2006

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search (63 % journalistic, 20 % fiction, 12 % professional and 5 % other texts)

prim-2.0-public-all

250 million tokens

pilot

2005

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search

prim-1.0-public-all

182 million tokens

test

2004

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search

prim-0.2-public-all

170 million tokens

no

2003

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search

prim-0.1-public-all

30 million tokens

no

2003

monolingual corpus, comprised of all texts published or written after the year 1955 under the license on on-line search

2. Written corpora − synchronous, web

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

web-5.0

4 042 million tokens / 3 326 million words

yes

2020

corpus of Slovak texts available on the web

web-4.0

2 963 million tokens / 2 440 million words

yes

2018

corpus of Slovak texts available on the web

web-3.0

2 372 million tokens / 1 993 million words

yes

2015

corpus of Slovak texts available on the web

web-2.0

1 046 million tokens / 839 million words

yes

2012

corpus of Slovak texts available on the web

web-1.0

952 million tokens / 773 million words

yes

2011

corpus of Slovak texts available on the web

wiki-2019-08

51 million tokens / 38 million words

yes

2020

corpus of texts from Slovak Wikipédia (as of 2019-08-01)

wiki-2018-03

47 million tokens / 35 million words

yes

2018

corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2018-03-15)

wiki-2017-02

45 million tokens / 34 million words

yes

2017

corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2017-02-28)

wiki-2016-02

43 million tokens / 34 million words

yes

2016

corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2016-02-26)

wiki-2015-02

40 million tokens / 32 million words

yes

2015

corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2015-02-28)

3. Written corpora − synchronous, specialised

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

blf-2.0

66 million tokens / 54 million words

yes

2014

corpus of religious texts

blf-1.0

15 million tokens / 12 million words

yes

2008

corpus of religious texts

cw-2014-all

1.6 million tokens / 1.2 million words

yes

2014

corpus of copywrighting texts

ecn-2.0-public

165 million tokens / 140 million words

yes

2016

corpus of economic texts (3.8 % professional and 96.2 % journalistic texts from the field of economics, banking, trade, management and merchandising)

ecn-1.0-public

20 million tokens / 17 million words

yes

2014

corpus of economic texts (81.4 % professional and 18.6 % journalistic texts from the field of economics, banking, trade, management and merchandising)

hum-1.0-public

39 million tokens / 30 million words

yes

2016

corpus of humanistic texts

judikat-1.0

1.5 million tokens / 1.3 million words

yes

2015

corpus of judicial decisions

legal-1.1

49 million tokens / 40 million words

yes

2013

corpus of legal texts (deduplicated)

legal-1.0

147 million tokens / 114 million words

yes

2011

corpus of legal texts

prim-7.0-frk

253 million tokens / 203 million words

yes

2018

The reference corpus prim-7.0-frk was the source for Frekvenčný slovník slovenčiny na báze Slovenského národného korpusu (Slovak Frequency Dictionary Based on the Slovak National Corpus), as well as for the examples listed in the publication Skloňovanie podstatných mien v slovenčine s korpusovými príkladmi (Declension of the Slovak Nouns with Corpus Examples).

r-mak-6.0

1.2 million tokens / 978 thousand words

yes

2017

manually morphologically annotated corpus (30.6 % journalistic, 50.2 % fiction, 19.2 % professional texts)

r-mak-5.0

1.2 million tokens / 978 thousand words

yes

2016

manually morphologically annotated corpus (28.5 % journalistic, 44.5 % fiction, 27 % professional texts)

r-mak-4.0

1.2 million tokens / 977 thousand words

yes

2013

manually morphologically annotated corpus (36.2 % journalistic, 44.9 % fiction, 18.9 % professional texts)

r-mak-3.0

1.2 million tokens / 984 thousand words

yes

2008

manually morphologically annotated corpus (36.7 % journalistic, 44.3 % fiction, 19.0 % professional texts)

r-mak-2.0

511 thousand tokens / 410 thousand words

yes

2007

manually morphologically annotated corpus (28.9 % journalistic, 58.1 % fiction, 13.0 % professional texts)

r-mak-1.0

322 thousand tokens / 257 thousand words

yes

2006

manually morphologically annotated corpus (41.8 % journalistic, 57.9 % fiction, 0.2 % professional texts)

4. Written corpora − parallel

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release (first version released in)

characteristics

par-skbg-0.1

163 million tokens / 108 million words

yes,
both languages

2014

Slovak-Bulgarian parallel corpus: 78 million tokens in Slovak half, 85 million tokens in Bulgarian half

par-skcs-all-4.0

418 million tokens / 306 million words

yes,
both languages

2016
(2010)

Slovak-Czech parallel corpus: 209 million tokens in Slovak half, 209 million tokens in Czech half

par-skcs-fic-5.0

31.5 million tokens / 25.0 million words

yes,
both languages

2018
(2010)

Slovak-Czech parallel corpus, subcorpus fiction: 15.7 million tokens in Slovak half, 15.8 million tokens in Czech half

par-skde-all-2.0

446 million tokens / 300 million words

yes,
both languages

2016
(2014)

Slovak-German parallel corpus: 220 million tokens in Slovak half, 226 million tokens in German half

par-sken-4.0

556 million tokens / 436 million words

yes,
both languages

2015
(2010)

Slovak-English parallel corpus: 261 million tokens in Slovak half, 295 million tokens in English half

par-skfr-3.0

449 million tokens / 332 million words

yes,
both languages

2016
(2006)

Slovak-French parallel corpus: 217 million tokens in Slovak half, 233 million tokens in French half

par-skhu-1.0

99 million tokens / 75 million words

yes,
both languages

2015
(2014)

Slovak-Hungarian parallel corpus: 51 million tokens in Slovak half, 48 million tokens in Hungarian half

par-skla-3.0

5.0 million tokens / 4.1 million words

yes,
both languages

2018
(2012)

Slovak-Latin parallel corpus: 2.7 million tokens in Slovak half, 2.3 million tokens in Latin half

par-skpl-1.0

8.2 million tokens / 6.5 million words

yes,
both languages

2018
(2018)

Slovak-Polish parallel corpus: 4.1 million tokens in Slovak half, 4.1 million tokens in Polish half

par-skro-1.1

1.3 million tokens / 1.0 million words

yes,
both languages

2017
(2016)

Slovak-Romanian parallel corpus: 603 thousand tokens in Slovak half, 689 thousand tokens in Romanian half

par-skru-2.0

8.5 million tokens / 6.6 million words

yes,
both languages

2014
(2012)

Slovak-Russian parallel corpus: 4.2 million tokens in Slovak half, 4.2 million tokens in Russian half

5. Written corpora of texts before the year 1955

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

r864az1843-1.0

2.1 million tokens / 1.6 million words

no

2015

corpus of texts from 864–1843: texts transcribed into contemporary Slovak, orthography as used in the latest edition)

r1843az1954-1.0

24 million tokens / 19 million words

no

2015

corpus of texts from 1843–1954: texts transcribed into contemporary Slovak, orthography as used in the latest edition

6. Historical corpus

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

hist-5.0

998 thousand tokens / 731 thousand words

no

2020

corpus of historical Slovak: source materials (in original spelling)

hist-4.0

918 thousand tokens / 668 thousand words

no

2015

corpus of historical Slovak: source materials (in original spelling)

hist-3.0

836 thousand tokens / 600 thousand words

no

2015

corpus of historical Slovak: source materials (in original spelling)

hist-2.0

552 thousand tokens / 422 thousand words

no

2014

corpus of historical Slovak: source materials (in original spelling)

hist-1.0

371 thousand tokens

no

2012

corpus of historical Slovak: source materials (in original spelling)

7. Spoken corpora − synchronous, standard

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

s-hovor-6.0

6.6 million tokens / 5.5 million words

yes

2017

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

s-hovor-6.0-sane

3.7 million tokens / 3.0 million words

yes

2017

subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute

s-hovor-6.0-upn

2.9 million tokens / 2.4 million words

yes

2017

subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute

s-hovor-5.0

5.7 million tokens / 4.7 million words

yes

2015

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

s-hovor-5.0-sane

3.6 million tokens / 3.0 million words

yes

2015

subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute

s-hovor-5.0-upn

2.1 million tokens / 1.8 million words

yes

2015

subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute

s-hovor-4.0

2.6 million tokens / 2.2 million words

yes

2012

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

s-hovor-4.0-sane

1.6 million tokens / 1.3 million words

yes

2012

subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute

s-hovor-4.0-upn

1.0 million tokens / 0.9 million words

yes

2012

subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute

s-hovor-3.0

2.1 million tokens / 1.4 million words

yes

2011

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

s-hovor-2.0

679 thousand tokens / 561 thousand words

yes

2010

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

s-hovor-1.0

128 thousand tokens / 104 thousand words

yes

2008

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

8. Corpora of dialects of the SNC

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

dialekt-4.0

712 thousand tokens / 571 thousand words

no

2018

corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas

dialekt-3.0

495 thousand tokens / 403 thousand words

no

2016

corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas

dialekt-2.0

329 thousand tokens / 252 thousand words

no

2015

corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas

dialekt-1.0

74 thousand tokens / 55 thousand words

no

2014

corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas