Omnia consists of the following corpora prim-6.0-public-all, s-hovor-4.0, legal-1.1, web-1.1 and web-1.2. In the corpora, the duplicate content has been removed and several minor changes applied. In tokenization, compounds are treated as a single token. In lemmatization, negative forms are lemmatized by the affirmative form. The corpus has been derived from the SNK by V. Benko. It serves primarily as material for members of the department of lexicology and lexicography.
The current version omnia-2.0-public was released in July 2013. The database contains 2 239 413 083 tokens. Previous versions were for internal use only.