Structure of the corpus prim-6.0
The version prim-6.0 is comprised of the publicly available subcorpora:
- prim-6.0-public-all – all publicly available SNC texts (77.8 % journalistic, 9.8 % fiction, 11 % professional and 1.4 % other texts), 1 155 742 085 tokens, 881 084 173 words
- prim-6.0-public-sane –excluding texts with incorrect diacritics, before the year 1955, from outside the territory of Slovakia, and from linguistic journals, 1 121 400 341 tokens, 854 175 017 words
- prim-6.0-public-vyv – balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texty), 313 465 778 tokens, 244 845 182 words
- prim-6.0-public-inf – subcorpus of journalistic (informational) texts, 888 867 082 tokens, 669 224 534 words
- prim-6.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 106 482 645 tokens, 84 326 245 words
- prim-6.0-public-img – subcorpus of fiction texts, 113 570 423 tokens, 90 466 310 words
- prim-6.0-public-sk – subcorpus of original Slovak texts, 905 332 650 tokens, 683 770 527 words
- prim-6.0-public-img-sk – subcorpus of original Slovak fiction texts, 34 773 737 tokens, 27 842 352 words
- r55az89-3.0 – specific corpus of texts from years 1955–1989 (11.9 % journalistic, 55.5 % fiction, 24.1 % professional and 8.5 % other texts), 62 885 729 tokens, 50 531 833 words