→po slovensky

Slovak-English parallel corpus

The query interface for the complete combined corpus is here.

Parallel Slovak-English Corpus is a database containing automatically sentence aligned texts for both Slovak and English language.

Enter the query term into the input field Search; you may also use the CQL syntax. In the selection box corpus, choose the desired source for a particular term (par-sken-*-sk for Slovak texts and par-sken-*-en for English texts). By clicking on the leftmost column, a short bibliography will be displayed.

Slovak texts are automatically morphologically annotated with the same tagset as in the Slovak National Corpus. English texts are part-of-speech tagged with the Penn Treebank Tagset.

Version 2.0

(available since 2012-09-08)

The corpus consists of two parts – the subcorpus of “fiction” and the free subcorpus.

The subcorpus “fiction” contains 4.3 million sentence pairs (63.3 million tokens in the English half, 53.9 million tokens in the Slovak one).

The subcorpus query interface is here.

The free subcorpus consists of less copyright-encumbered texts and can be downloaded here.

There is also a corpus query interface for the combined “fiction” + free subcorpus, containing 10 million sentence pairs (196 million tokens in the English half, 173 million tokens in the Slovak one).

Version 1.0

Version 1.0 contained 1.6 million sentence pairs (24 million tokens in the English half, 20 million tokens in the Slovak one).


Development of corpus supported by the EC grant FP7-ICT-2009-5 Bringing Machine Translation for European Languages to the User – Enlarged European Union (EuroMatrixPlus-X).

Developed jointly by Slovak National Corpus, Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences and Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague.