Structure of the corpus prim-5.0
The version prim-5.0 contains the following publicly available subcorpora:
- prim-5.0-public-all – all publicly available SNC texts, 719 499 235 tokens (73 % journalistic, 14 % fiction, 12 % professional and 1 % other texts)
- prim-5.0-public-inf – subcorpus of journalistic (informational) texts, 514 588 190 tokens
- prim-5.0-public-prf – subcorpus of scientific and professional texts, 82 390 173 tokens
- prim-5.0-public-img – subcorpus of fiction texts, 99 235 619 tokens
- prim-5.0-public-sk – subcorpus of original texts in Slovak, 508 662 478 tokens
- prim-5.0-public-skimg – subcorpus of original fiction texts in Slovak, 31 745 338 tokens
- prim-5.0-public-sane – subcorpus excluding texts that disregard some of the criteria (correct diacritics, contemporary standardized language, non-linguistic texts), 699 496 280 tokens
- prim-5.0-vyv – balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts), 247 180 756 tokens