Corpus of Spoken Slovak

The current version, s-hovor-5.0 containing 5.72 million tokens has been available since April 2015. The corpus is composed of 695 audio recordings, which in fact is more than 600 hours of recorded utterances.

The first version s-hovor was released in December 2008, the version s-hovor-2.0 in January 2010, the version s-hovor-3.0 in February 2011 and the version s-hovor-4.0 in August 2012.

The versions 4.0 and 5.0 include two subcorpora: s-hovor-x-upn, which contains transcribed recordings of witnesses from the Project Oral History within the Nation’s Memory Institute and s-hovor-x-sane containing the other recordings from the primary corpus. A user can query the corpus through a Bonito client (as part of the SNC registration) or through the WWW interface where the text transcription is aligned to the audio.

The text transcriptions are lemmatized and morphologically anotated. The transcription metadata contain information about the participant, origin and content of the audio recording.

A user can enter a word, lemma or pronunciation in the input field and the transcription will be displayed.