Corpus Structure

Monolingual corpus of written texts

The current version prim-6.1 has been available since September 2013. The publicly available subcorpus contains more than 829 million tokens.

There are only two publicly available versions of the corpus and several earlier ones. One can get access to the earlier versions by request:

prim-6.0 – released in 2013 (77.8 % journalistic, 9.8 % fiction, 11 % professional and 1.4 % other texts). 1 155 million tokens
prim-5.0 – released in 2011 (73% journalistic, 14% fiction, 12% professional and 1% other texts), 719 million tokens
prim-4.0 – released in 2009 (65% journalistic, 17% fiction, 16% professional and 2% other texts), 526 million tokens
prim-3.0 – released in 2007 (57% journalistic, 21.5% fiction, 18.5% professional and 3% other texts), 350 million tokens
prim-2.1 – released in 2006 (63% journalistic, 20% fiction, 12% professional and 5% other texts), 300 million tokens
prim-2.0 – released in 2005, 250 million tokens
prim1 – released in 2004, 182 million tokens

r-mak-3.0 – a manually morphologically annotated corpus (44.3% fiction, 36.7% journalistic and 19.0% professional texts); 1 207 813 tokens
r-mak-2.0 – manually morphologically annotated corpus (58.1% fiction, 28.9% journalistic and 13.0% professional texts); 511 534 tokens
r-mak-1.0 – manually morphologically annotated corpus (57.9% fiction, 41.8% journalistic and 0.2% professional texts); 322 600 tokens