Language data

The data is jointly released by the Slovak National Corpus, Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences and Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague.

These (and other) datasets relevant for MT are also available from the Clarin ERIC repository located at the LINDAT-Clarin project page.

To get access to the files, please contact us.

Translation tables for the Moses MT system

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

These tables will help you build your own MT system.

English->Slovak based on parallel corpus of fiction
Slovak->English based on parallel corpus of fiction
English->Slovak factorized (lemma+tag) model, based on parallel corpus of fiction
Slovak->English factorized (lemma+tag) model, based on parallel corpus of fiction
English->Slovak based on parallel corpus of fiction+Europarl v6
Slovak->English based on parallel corpus of fiction+Europarl v6
Czech->Slovak based on parallel Slovak-Czech corpus and Europarl v6
Slovak->Czech based on parallel Slovak-Czech corpus and Europarl v6
Slovak->Czech supplementary phrase table of inflected word forms
Czech->Slovak supplementary phrase table of inflected word forms
Slovak->English supplementary phrase table of inflected noun word forms
English->Slovak supplementary phrase table of inflected noun word forms

You can get more complete language models here.

Parallel corpora (English-Slovak)

Slovak texts are automatically lemmatized and morphologically annotated with the Slovak National Corpus tagset. English texts are lemmatized and part-of-speech tagged with the Penn Treebank Tagset.

Corpus	source	sentence pairs
Official Journal of the European Union	http://apertium.eu/data (oj4-ss-1)	3272180
OPUS, the open parallel corpus	http://opus.lingfil.uu.se/
OPUS-EMEA		1054178
OPUS-EUconst		10119
OPUS-KDE4		105425
OPUS-PHP		31173
JRC-Acquis 3.0	http://langtech.jrc.it/JRC-Acquis.html	1115765
The European Commission webpage	http://ec.europa.eu/	13050
Europarl v6	http://www.statmt.org/europarl/	460779

Parallel corpora (Slovak-Czech)

Slovak texts are automatically lemmatized and morphologically annotated with the Slovak National Corpus tagset. Czech texts are automatically lemmatized and morphologically annotated with the Czech National Corpus tagset.

Corpus	source	sentence pairs
Official Journal of the European Union	http://apertium.eu/data (oj4-ss-1)	3078210
OPUS, the open parallel corpus	http://opus.lingfil.uu.se/
OPUS-EMEA		1067905
OPUS-EUconst		10630
OPUS-KDE4		97260
OPUS-PHP		28084
JRC-Acquis 3.0	http://langtech.jrc.it/JRC-Acquis.html	926082
The European Commission webpage	http://ec.europa.eu/	24190
Europarl v6	http://www.statmt.org/europarl/	459089

Supported by the EC grant FP7-ICT-2009-5 Bringing Machine Translation for European Languages to the User – Enlarged European Union (EuroMatrixPlus-X).