Language data
The data is jointly released by the Slovak National Corpus, Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences and Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague.
These (and other) datasets relevant for MT are also available from the Clarin ERIC repository located at the LINDAT-Clarin project page.
To get access to the files, please contact us.
Translation tables for the Moses MT system
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
These tables will help you build your own MT system.
Slovak->English based on parallel corpus of fiction
English->Slovak factorized (lemma+tag) model, based on parallel corpus of fiction
Slovak->English factorized (lemma+tag) model, based on parallel corpus of fiction
English->Slovak based on parallel corpus of fiction+Europarl v6
Slovak->English based on parallel corpus of fiction+Europarl v6
Czech->Slovak based on parallel Slovak-Czech corpus and Europarl v6
Slovak->Czech based on parallel Slovak-Czech corpus and Europarl v6
Slovak->Czech supplementary phrase table of inflected word forms
Czech->Slovak supplementary phrase table of inflected word forms
Slovak->English supplementary phrase table of inflected noun word forms
English->Slovak supplementary phrase table of inflected noun word forms
You can get more complete language models here.
Parallel corpora (English-Slovak)
Slovak texts are automatically lemmatized and morphologically annotated with the Slovak National Corpus tagset. English texts are lemmatized and part-of-speech tagged with the Penn Treebank Tagset.
Corpus
source
sentence pairs
http://apertium.eu/data (oj4-ss-1)
3272180
OPUS-EMEA
1054178
OPUS-EUconst
10119
OPUS-KDE4
105425
OPUS-PHP
31173
1115765
13050
460779
Parallel corpora (Slovak-Czech)
Slovak texts are automatically lemmatized and morphologically annotated with the Slovak National Corpus tagset. Czech texts are automatically lemmatized and morphologically annotated with the Czech National Corpus tagset.
Corpus
source
sentence pairs
http://apertium.eu/data (oj4-ss-1)
3078210
OPUS-EMEA
1067905
OPUS-EUconst
10630
OPUS-KDE4
97260
OPUS-PHP
28084
926082
24190
459089
Supported by the EC grant FP7-ICT-2009-5 Bringing Machine Translation for European Languages to the User – Enlarged European Union (EuroMatrixPlus-X).