Structure of the corpus prim-6.1
The version prim-6.1 is comprised of the publicly available subcorpora (excluding incorrectly converted texts from prim-6.0):
- prim-6.1-public-all – all publicly available SNC texts (68.8 % journalistic, 13.9 % fiction, 15.3 % professional and 2 % other texts), 829 771 945 tokens, 655 572 511 words
- prim-6.1-public-sane – excluding texts with incorrect diacritics, before the year 1955, from outside the territory of Slovakia, and from linguistic journals, 773 493 137 tokens, 610 493 493 words
- prim-6.1-public-vyv – balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts), 317 496 718 tokens, 251 519 537 words
- prim-6.1-public-inf – subcorpus of journalistic (informational) texts, 540 812 859 tokens, 425 325 094 words
- prim-6.1-public-prf – subcorpus of scientific, professional and non-fiction texts, 105 886 349 tokens, 83 885 837 words
- prim-6.1-public-img – subcorpus of fiction texts, 113 820 575 tokens, 90 714 140 words
- prim-6.1-public-sk – subcorpus of original Slovak texts, 558 261 948 tokens, 440 708 351 words
- prim-6.1-public-img-sk – subcorpus of original Slovak fiction texts, 35 283 156 tokens, 28 260 019 words
- r55az89-3.0 – specialized corpus of texts from years 1955–1989 (11.9 % journalistic, 55.5 % fiction, 24.1 % professional and 8.5 % other texts), 62 885 729 tokens, 50 531 833 words.
The corpus is provided with detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated.