Structure of the corpus prim-7.0
The version prim-7.0 of the SNC is comprised of the publicly available subcorpora:
- prim-7.0-public-all – all publicly available SNC texts (65.1 % journalistic, 15.1 % fiction, 9.5 % professional and 10.3 % other texts), 1 250 382 876 tokens, 971 799 239 words
- prim-7.0-public-sane – excluding texts with incorrect diacritics, before the year 1955, from outside the territory of Slovakia, and from linguistic journals, 1 089 102 930 tokens, 848 547 025 words
- prim-7.0-public-vyv – balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts), 340 708 046 tokens, 266 732 524 words
- prim-7.0-public-inf – subcorpus of journalistic (informational) texts, 771 248 707 tokens, 597 141 681 words
- prim-7.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 114 081 861 tokens, 89 152 482 words
- prim-7.0-public-img – subcorpus of fiction texts, 187 749 798 tokens, 149 220 076 words
- prim-7.0-public-sk – subcorpus of original Slovak texts, 806 707 046 tokens, 629 681 531 words
- prim-7.0-public-img-sk – subcorpus of original Slovak fiction texts, 65 009 205 tokens, 51 839 437 words
- r1955az1989-4.0 – specific corpus of texts from years 1955–1989 (7.4 % journalistic, 69.3 % fiction, 16.6 % professional and 6.7 % other texts), 67 392 068 tokens, 53 998 092 words.
The corpus is provided with detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated.