Slovak-Czech Parallel Corpus

The latest version par-skcs-4.0 was released in May 2016. The database contains 418.5 million tokens (209.2 million in the Slovak half, 209.3 million in the Czech half).

The corpus consists of the following parts: the subcorpus of fiction and the free subcorpus. Apart from fiction, the subcorpus of fiction (19 million tokens) also contains other texts e.g. popular science, literature of fact, etc. User can query the corpus using the NoSketch Engine or through simple WWW interface. Subcorpus par-skcs-fic-4.0 contains texts identical to par-skcs-fic-3.0.

The free subcorpus consists of EU legal texts and reports, computational and other manuals translated from the third (English) language. The texts can be downloaded here.

Slovak-Czech Parallel Corpus is a database of texts that are translations of each other, Slovak texts are translated into Czech or vice versa. Texts are automatically sentence aligned. Slovak texts are automatically morphologically annotated by taggers Morče and MorphoDiTa which have been trained and tuned on the tagset developed by the Slovak National Corpus. The Czech texts are annotated by the tagger Morče which has been trained and tuned on the tagset developed by the Czech National Corpus.

There are several ways how to query the corpus:

through NoSketchEngine in the Czech half, in the Slovak half. Knowledge of NoSketch Engine and CQL is highly recommended.
through a dictionary interface. This does not contain the whole corpus, just automatically selected translation equivalents.
through a simple WWW interface. Enter the query term (Slovak/Czech word or a regular expression) into the input field Search. In the selection box corpus, choose the desired source for a particular term (par-skcs-*-sk for Slovak texts and par-skcs-*-cs for Czech texts). By clicking on the leftmost column, a short bibliography will be displayed.

Version 3.0

The corpus par-skcs-3.0 was released in January 2014. The database contained 240 million tokens (119.4 million in the Slovak half, 119.53 million in the Czech half).

The subcorpus of fiction par-skcs-fic-3.0 contained 19 million tokens (approximately 9.5 million for each half).

Version 2.0

The corpus par-skcs-2.0 contained 6 433 thousand sentence pairs (approximately 120 million tokens for each half).

The subcorpus of fiction contained 740 thousand sentence pairs (approximately 10 million for each half).

Version 1.0

The corpus par-skcs-1.0 contained 735 thousand sentence pairs (10 million tokens per language).

Development of the free corpus supported by the EC grant FP7-ICT-2009-5 Bringing Machine Translation for European Languages to the User – Enlarged European Union (EuroMatrixPlus-X).

Developed jointly by: Slovak National Corpus, Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Czech National Corpus, Faculty of Arts, Charles University in Prague and Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague.