Corpus of Spoken Slovak

The current version, s-hovor-6.0 containing 6.6 million tokens has been available since November 2017. The corpus is composed of 760 audio recordings, which is more than 714 hours of recorded utterances.

The first version s-hovor was released in December 2008, the version s-hovor-2.0 in January 2010, the version s-hovor-3.0 in February 2011, the version s-hovor-4.0 in August 2012 and the version s-hovor-5.0 in April 2015.

From the version s-hovor-6.0, the symbols used for transcription (turn.[ogg|spx|flac]) are available right in the search tool NoSketch Engine; the users are also given the possibility to hear the relevant part of the audio recording. The versions 4.0, 5.0 and 6.0 include two subcorpora: s-hovor-x-upn, which contains transcribed recordings of witnesses from the Project Oral History within the Nation’s Memory Institute and s-hovor-x-sane containing the other recordings from the primary corpus. A user can query the corpus through a Bonito client (as part of the SNC registration) or through the WWW interface where the text transcription is aligned to the audio.

The text transcriptions are lemmatized and morphologically anotated. The transcription metadata contain information about the participant, origin and content of the audio recording.

A user can enter a word, lemma or pronunciation in the input field and the transcription will be displayed.

Slovak National Corpus

Corpus of Spoken Slovak