Annotation of the research and development project realisation


Názov úlohy: Integrated Computational Processing of the Slovak Language for Linguistic Research Purposes

Číslo úlohy: 2003SP200280307

Názov štátneho programu: Aktuálne otázky rozvoja spoločnosti

Dodávateľ riešenia úlohy: Jazykovedný ústav Ľ. Štúra SAV Bratislava

Zodpovedný riešiteľ: PhDr. Mária Šimková

Číslo odboru VaV podľa číselníka odborov: 060208

Kľúčové slová: corpus, corpus linguistics, language technologies, tokenisation, lemmatisation, linguistic annotation, tag set, conversion, representativeness of the corpus, terminology database, digitisation of the linguistic research, parallel corpora


The project of the Integrated Computational Processing of the Slovak Language for Linguistic Research Purposes had been carried out at the Ľudovít Štúr Institute of Linguistics of the SAV in Bratislava since July 7th 2003 till December 31st 2006. The contractually agreed project result to be achieved within the contract period was a linguistically annotated representative corpus of texts of contemporary Slovak language containing 200 million of tokens available on the internet; conception and partial results of the computational processing of the linguistic research in Slovakia. The Contract stipulated following range of content in order to meet the defined goal: sociolinguistic analysis of the style-genre distribution of texts of the contemporary Slovak language and appropriate text stratification in the Slovak National Corpus; collection of texts and obtaining their copyright clearance on the basis of the Contract of Other Use of the Work; technical processing of texts; testing of existent (foreign) software, continuous development of the Department's own software; conceptual preparation and carrying out of the linguistic annotation of texts; preparation of several specialised subcorpora and databases.

The team managed to exceed twofold the initial goal, which was to release a 200-million corpus of texts on the internet. In order to create, release and administer the Corpus it was essential to elaborate the overall conception as well as guidelines for respective areas of text collection and procession, to test foreign software tools and to develop new ones according to the actual situation of the development of information technologies and with respect to the specificities of the Slovak language.

Elimination of the Corpus unbalance could thus continue according to the plan as well as incorporation of missing texts. Stylistically balanced corpus of the last version reaches the limit of 200 million tokens and can be considered to be the first version of a representative corpus of contemporary Slovak texts. Balanced corpus from this new version will be distributed also on CD/DVD media especially for didactic purposes.

Conceptual preparation and carrying out of the linguistic annotation of texts: morphological and syntactic annotations were carried out fully in accordance with required guidelines of the content realisation. By developing the morphological analyser for the Slovak language the Department net and even went beyond progressive development of genuine software.

Reducing the last set of task into three enabled the project team to fully met this point of the required project extent in full range. The Slovak terminology database finished the preparation phase, the database of lexicographical works contains together with linguistic resources 12 items overall (excluding different volumes of some works or issues of journals and proceedings) and represents and excellent source of information on the Slovak language and its research. Both databases are very important and generally useful parts of the computational processing of the linguistic research in Slovakia. From among parallel corpora, there are two available and ready to serve for comparative research as well as for didactic purposes for foreigners or translators.

In spite of certain limitations and problems resulting from the continuous financing of the project realisation of the State research and development programme Integrated Computational Processing of the Slovak Language for Linguistic Research Purposes the team of the SNC Department created a solid basis for systematic development of the computational and corpus linguistics fields in Slovakia and for the computational processing of Slovak as a natural language.


Number of pages:

301

Number of supplements:

31

Number of drawing:

Number of copies:

13