Slovak-French Parallel Corpus

The current version par-skfr-3.0 was released in November 2016. The corpus includes 450 million tokens (216.6 million in the Slovak part and 232.5 million in the French part).

The corpus consists of two parts: the subcorpus of fiction and the free subcorpus. You can query the subcorpus of fiction (containing about 10 million tokens) by using NoSketch Engine. The subcorpus of fiction par-skfr-fic-3.0 has been extended with texts written mainly by Jules Verne, i.e. 57 per cent of the subcorpus.

Slovak-French Parallel Corpus is a database of texts that are translations of each other, Slovak texts are translated into French or vice versa, it also includes translations from the third language into Slovak and French. Texts are automatically sentence aligned. The Slovak texts are automatically morphologically annotated by the tagger Morče and MorphoDiTa which have been trained and tuned on the tagset developed by the Slovak National Corpus and the French texts are annotated by TreeTagger.

A user can query the corpus through the NoSketchEngine web interface in the French half or in the Slovak half. Knowledge of NoSketch Engine and CQL is highly recommended.

Version 2.0

The version par-skfr-2.0 was released in May 2016. The corpus includes 441.5 million tokens (213.3 million in the Slovak part and 228.2 million in the French part). Par-skfr-fic-2.0 is identical to its previous version.

Version 1.0

The version par-skfr-1.0 was released in October 2015. It included 350 million tokens (167.4 million in the Slovak part and 181.28 million in the French part).

Version 0.1

The first, testing version of the Slovak-French parallel corpus was released in 2006, preceded by the first parallel corpus of the SNC containing approximately 125 million tokens (more than 59 million in the Slovak half and 66 million in the French half). Apart from fiction, the subcorpus also consisted of free translations of the EU texts, that had been included into the subcorpus, as the first parallel corpus of the SNC.