Structure of the corpus prim-8.0
The version prim-8.0 of the SNC is comprised of the publicly available subcorpora:
- prim-8.0-public-all – all publicly available SNC texts (71.10 % journalistic, 15.22 % fiction, 8.51 % professional and 5.17 % other texts), 1 477 447 216 tokens, 1 160 286 731 words
- prim-8.0-public-sane – excluding texts with incorrect diacritics, published before the year 1955, from outside the territory of Slovakia, and from linguistic journals (73.75 % journalistic, 16.33 % fiction, 8.91 % professional, 1.01 % other texts), 1 368 990 447 tokens, 1 076 309 519 words
- prim-8.0-public-vyv – balanced subcorpus (33.33 % journalistic, 33.33 % fiction, 33.33 % professional texts), 377 138 077 tokens, 297 524 160 words
- prim-8.0-public-inf – subcorpus of journalistic (informational) texts, 1 009 613 215 tokens, 791 376 893 words
- prim-8.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 121 926 591 tokens, 96 084 340 words
- prim-8.0-public-img – subcorpus of fiction texts, 223 552 510 tokens, 177 545 076 words
- prim-8.0-public-sk – subcorpus of original Slovak texts, (81.24 % journalistic, 7.91 % fiction, 9.53 % professional, 1.32 % other texts), 1 042 623 207 tokens, 821 878 724 words
- prim-8.0-public-img-sk – subcorpus of original Slovak fiction texts, 82 503 983 tokens, 65 627 003 words
- r1955az1989-4.0 – specific corpus of texts from years 1955–1989 (5.11 % journalistic, 75.73 % fiction, 13.82 % professional, 5.34 % other texts), 83 631 422 tokens, 66 825 217 words
The corpus is provided with a detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated by MorphoDiTa tagger which has been trained at the Slovak National Corpus.