Bibliographical, Style and Genre Annotation
Bibliographical, style and genre annotation are inevitable parts of the primary processing of corpus texts. Information about the identity and the basic text structure are useful for its archiving, citation, statistical evaluation of parameters or investigating the distribution of language units and language phenomena in particular texts. The annotation will be displayed at the bottom of the client Bonito window by clicking on the desired line in a concordance list with the right mouse button. The annotation consists of keys together with values, which can be either free (e.g. author's name) or other (e.g. genre). Keys can refer to style and genre characteristics of text. The main categories are type of text (literary, journalistic, professional, live communication), genre (poem, novel, short story, article, etc.) and domain (subject area, e.g. science, law, politics, economy). These categories can be further divided. Other keys provide the bibliographic details of a source and information about the author and text. Here is the list of keys under which you can find relevant information.
Date format
All the dates are expressed according to ISO 8601 YYYY-MM-DD, e.g. 1998-05-23 in order to make them clear and keep them organized.
External annotation
External annotation uses the key-value structure. Value is a string of characters finished at the end of each line. The multi-line names are therefore excluded. The values may be either free (e.g. name of author) or chosen from specified values (e.g. genre). Optional flags consists of a set of flags separated by commas. Each flag establishes a particular characteristic of a value. These values have a special meaning (they are not necessarily meaningful for all the keys):
- ... (three dots)
- undefined value. This value is not defined, because it is not able to be defined. It is incomplete. It should not be in real annotation.
- (an empty space or a whitespace)
- the same as „...“. Default value in the automatic annotation. But we suppose it will appear.
- missing key
- has the same value as the undefined key („...” or empty)
- XXX
- unknown value. It cannot be defined, e.g. author's name in article.
- YYY
- undefinable value. It cannot be defined or has no meaning. It cannot be defined or has no meaning, e.g. gender of author (in collaborative work), gender of translator (if not a translation).
- MIX
- mixture. Mixed values, e.g. author is a hermaphrodite.
- MSC
- other. If the value is not defined in the set of values, e.g. author is a eunuch.
- TTT
- unknown value which needs to be defined. The annotation must be completed, the value added.
Annotation of the bank
none of the following keys are not mandatory to use, but the SourceId. Keys are in the form of title (abbreviation). Its meaning is described under the corresponding key and its possible values are listed, if not free.
Name (Name)
- name of text.
Origname (OrgN)
- original name of text (in translation).
Author (Auth)
- author's name. As listed in resources under the standards for bibliographic records.
Origauthor (OrgA)
- original author's name (not under Slovak bibliographical rules), e.g. „Mirosława Siędzikowska“.
Translator (Trnr)
- name of translator. YYY, if not a translation.
Translation (Trnn)
- determines whether the text has been translated.
Values:
- trn
- translation
- org
- original text
- ftr
- loosely translated, retold text
- YYY
- combination of a translated and original text (e.g. a collection of short stories)
ISBN (ISBN)
- ISBN number.
ISSN (ISSN)
- ISSN number.
SourceId (ScId)
- ID of document of archive (remains the same in the bank).
Id (Id)
- Clear ID of the bank.
Rhyme (Rhym)
- rhyming.
Values:
- nrh
- unrhymed
- rhy
- rhymed
- MIX
- partially rhymed
Type (Type)
- type of text.
Values:
- img
- literary (imaginative) text
- inf
- journalistic (informative) text
- prf
- professional text
- liv
- live communication
Subtype (SubT)
- subtype of text.
Subtype (SubT) subtype of text — values |
|||
for Type=img |
for Type=inf |
for Type=prf |
for Type=liv |
(literary (imaginative) text) |
(journalistic (informative) text) |
(professional text) |
(live communication) |
poe |
pub |
sci |
spk |
pro |
adv |
pop |
wri |
dra |
adm |
txb |
|
|
|
enc |
|
|
|
man |
|
Genre (Genr)
- genre.
Genre (Genr) genre — values |
||
for Type=img |
for Type=inf |
for Type=prf |
(literary (imaginative) text) |
(journalistic (informative) text) |
(professional text) |
ver |
doc (documentary) |
mon |
son |
ann (announce) |
hnd |
scd |
lst (list) |
dis |
scf |
rpt (report) |
std |
scr |
anl (analytic) |
abs |
nov |
pbb (belles-lettres) |
tcl |
col |
spc |
rfl |
ess |
dsc |
lct |
mem |
|
crs |
let |
|
crt |
chr |
|
opn |
sen |
|
ins |
dia |
|
rig |
|
|
dpl |
|
|
ref |
Subgenre (SubG)
Values:
for Genre: nov, col, ver, (ess - no)
- crm
- crime, detective
- scf
- sci-fi, fantasy
- adn
- adventurous, westerns
- rms
- romance novels
- bel
- belles lettres
- jun
- junior literature
- trv
- travel literature
- fac
- nonfiction
Domain (Domn)
- domain.
Values:
- ars
- artistic science
- hum
- human science
- law
- law
- nat
- natural science
- tec
- technology
- ecn
- economy, management
- blf
- belief, supernatural
- lif
- life style
- ins
- interdisciplinary science
- plt
- politics
Subdomain (SubD)
- subdomain.
Subdomain (SubD) subdomain — values |
||||
for Domain = ars |
for Domain = hum |
for Domain = law |
for Domain = nat |
for Domain = tec |
mus |
his |
bil |
agr |
tra |
cin |
psy |
jud |
med |
ene |
arc |
edu |
jur |
pha |
ind |
art |
soc |
|
zoo |
com |
the |
phi |
|
bot |
bui |
lit |
inf |
|
bio |
sta |
|
pol |
|
che |
|
|
lin |
|
mat |
|
|
eth |
|
ggr |
|
|
cul |
|
phy |
|
|
swo |
|
met |
|
|
|
|
geo |
|
|
|
|
env |
|
for Domain = ecn |
for Domain = blf |
for Domain = lif |
for Domain = ins |
for Domain = plt |
eco |
rel |
hou |
no subdomain |
no subdomain |
mng |
teo |
fsh |
|
|
mer |
exc |
spo |
|
|
|
|
sct |
|
|
|
|
amu |
|
|
|
|
min |
|
|
|
|
reg |
|
|
|
|
cnl |
|
|
|
|
clt |
|
|
Medium (Medi)
- medium.
Values:
- lib
- book
- ebk
- e-book
- nws
- newspaper
- jou
- journal
- ste
- studying materials
- net
- the Internet and other (pre-internet) networks. These include specific Internet newspapers, websites, e-mail, usenet contributions, contributions to fora, and live communication. Note that print newspapers downloaded from the Internet are „nws“, electronic books intended primarily for publishing are „lib“, but the e-books primarily intended for on-screen viewing are „net“.
- for
- form
- occ
- occasional (miscellanies)
- npu
- non-published texts, handwritings
- tvf
- television, cinema
- rad
- radio
Authsex (AutS)
- sex of author.
Values:
- msc
- masculine
- fem
- feminine
Lang (Lang)
- language of work, three-letter abbreviation in ISO format 639-2, “slk” stands for the Slovak language, it is automatically generated. Non-Slovak texts do not usually occur in the Corpus.
Varieta (Vari)
- language variant of document. It is Slovak mostly.
Values:
- std
- standard Slovak
- nst
- non-standard Slovak
- ost
- old standard / before the orthography reform in 1953
Paragraphs (Para)
- determines the text division.
Values:
- tru
- true; text divided into paragraphs
- fls
- false; information on text division lost
Emphasis (Emph)
- information on presence of an original highlighted text.
Values:
- tru
- true
- fls
- false
Diacritics (Dcrt)
- text with correct or incorrect diacritics.
Values:
- tru
- true; correct diacritic marks
- fls
- false; incorrect or missing diacritic marks
Transsex (TrnS)
- sex of translator, see Authsex.
Origlang (OrgL)
original language of work according to ISO 639-3 http://www.sil.org/iso639-3/codes.asp. Translations of already translated texts are marked „>“ U+003C LESS-THAN SIGN. For example: eng>ger.
Date (Date)
- issue date.
Dateorig (OrgD)
- original issue date (first issue, it might be identical with “Date”), original issue date of translations.
Conglomerate (Cong)
- identification of conglomerate which the text is a part of.
Bogocong (Bogo)
- Multi-letter record of a conglomerate.
Comment (Comn)
- comment. It differs from the comment in archives.
Corrected (Corr)
- Document corrected or not.
Bibliography (Bibl)
- bibliography.
Noises
are marked by XML <noise/>. It replaces the unidentified parts. It could be relevant for spoken and diachronic corpus.
Images
- there are two types of images.
image
Big image has an individual information value. It is marked <picture> or: <picture caption="information on picture"/>.
Head-lines
There is only one type of head-line, it is marked <h1></h1>.
Highlighted text
There is only one type of highlighted text, it is marked <em></em>.
Hyphen/dash
If the type cannot be easily identified, we use U+002D HYPHEN-MINUS (-). U+2010 HYPHEN is used in for example „Rakúsko-Uhorsko“.
U+2014 EM DASH (—) is used for writing dashes, e. g. „Peniaze — radosť“. U+2212 MINUS SIGN (−) can be used as a unary or binary operator. But supposingly, the operator would not differ in the source document. In such case, we'd use U+002D HYPHEN-MINUS (-).
U+00AD SOFT HYPHEN is not clearly defined. It does not usually appear in the Corpus.
U+2011 NON-BREAKING HYPHEN is equal to U+2010 HYPHEN and never used in the Corpus.
Formulae
Mathematical, chemical and other formulae are marked <equation/>. Simple formulas, chemical compounds, reactions, etc. (used by general public such as H₂O) are marked by the UNICODE characters. We do not use LETTERLIKE SYMBOLS (e.g. instead of U+212A KELVIN SIGN we use U+004B LATIN CAPITAL LETTER K).
For subscripts and superscripts we use the following Unicode characters, e.g. U+00B9 SUPERSCRIPT ONE, U+2074 SUPERSCRIPT FOUR, U+207B SUPERSCRIPT MINUS, e.g. 10⁶ km².
For multiplication signs we use U+00D7 MULTIPLICATION SIGN × or U+00B7 MIDDLE DOT ·, depending on the source text. We do not make corrections! If “H2O” is used in the original text, we use “H2O”.
Tables
Tables are tagged as a <table/>, or <table caption="information on table"/>.
Quotation marks
Quotation marks of the original document are used. There are the following main styles:
"double English ASCII quotation marks"
- U+0022 QUOTATION MARK U+0022 QUOTATION MARK
'single English ASCII quotation marks'
- U+0027 APOSTROPHE U+0027 APOSTROPHE
„correct Slovak double quotation marks“
- U+201E DOUBLE LOW-9 QUOTATION MARK U+201C LEFT DOUBLE QUOTATION MARK (misleading UNICODE name!)
„incorrect Slovak double quotation marks”
- U+201E DOUBLE LOW-9 QUOTATION MARK U+201D RIGHT DOUBLE QUOTATION MARK
‚correct Slovak single quotation marks‘
- U+201A SINGLE LOW-9 QUOTATION MARK U+2018 LEFT SINGLE QUOTATION MARK (misleading UNICODE name!)
‚incorrect Slovak single quotation marks’
- U+201A SINGLE LOW-9 QUOTATION MARK U+2019 RIGHT SINGLE QUOTATION MARK
”correct English double quotation marks”
- ”U+201C LEFT DOUBLE QUOTATION MARK +201D RIGHT DOUBLE QUOTATION MARK
‘correct English single quotation marks’
- U+2018 LEFT SINGLE QUOTATION MARK U+2019 RIGHT SINGLE QUOTATION MARK
‹guillemet single›
- U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
«guillemet double»
- U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
›inverted guillemet single‹
- U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
»inverted guillemet double« p
- U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
There is a difference between U+0027 APOSTROPHE (') and U+2019 RIGHT SINGLE QUOTATION MARK (’), then between U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK (‹) and U+003C LESS-THAN SIGN (<) (likewise the right-pointing) as well as between U+201A SINGLE LOW-9 QUOTATION MARK (‚) and U+002C COMMA (,).
If incorrect quotation marks are found in the source document (e.g. ,comma and apostrophe' or , ,two commas and two apostrophes' '), we do not change them. They will be corrected while transformation from the bank into corpusoid. In LaTeX, two commas (, ,) stand for double low-9 quotation mark!
Some keys in the bank
Conglomerate
For books and similar works a conglomerate consists of author's name, hyphen (-) and name.
For journals, newspapers, etc. conglomerates are:
* journals
- www.bratislava.sk 1993-oct
- Kynologická revue 2001/05 (monthly, single issue in a conglomerate)
- Literárny týždenník 1998-oct (weekly, whole month in a conglomerate)
- Služba slova 2003/1 (monthly)
* newspapers
- Sninské noviny 2004/20
- SME 1998-may
* miscellanies
- Zborník Slovenského národného múzea - História 39
- Jozef Mlacek (Red.) - Studia Academica Slovaca 26
* books and similar works
- Elizabeth Adlerová - Žena je šťastie
- Bhagavadgíta
- Martin Ondrejka - Štúdium zirkónu a jeho využitie v súčasnej magmatickej petrológii
- Martin Pipíška, Jozef Augustín - Možnosti a obmedzenia využitia biologických systémov pre remediáciu pôdy kontaminovanej rádionuklidmi
Bogocong
For authorial books the bogocong consists of a multi-letter abbreviation: author's initials and ordinal number assigned to work of a particular author (starting in 1). In case of collaborative work, we use only the first letters of surnames and ordinal number. For journals and newspapers, bogocong consists of a journal abbreviation followed by YY/MM (YY - year, MM - month) or YY/CC (CC – journal issue).
* journals
- BA10/93
- KR01/05
- LT10/98
- SS01/03
* newspapers
- SN04/20
- SME 98/05
* miscellanies
- HIS39
- SAS26
* books and similar works
- EAdl5
- BHAG1
- MOnd1
- PA1
Bibliography
* journals
http://www.bratislava.sk. Bratislava: 2004.
- Kynologická revue. Veľká Ida: Ster.
- Literárny týždenník, Bratislava: Vydavateľstvo Spolku Slovenských spisovateľov 1997.
- Služba slova. Homiletická príloha Cirkevných listov pre evanjelických a.v. kňazov. Bratislava: VMV ECAV, 2003, roč. 52, č. 1.
* newspapers
- Sninské noviny. Regionálny týždenník. Snina: Ing. Michal Fečík - PRESS, 2004, roč. 2, č. 20.
- SME. Denník. Bratislava: Petit Press 7.5.1998
* miscellanies
- Zborník Slovenského národného múzea - História, 1999, roč. 93, č. 39.
- Studia Academica Slovaca 30. Prednášky XXXIII. letnej školy slovenského jazyka a kultúry. Red. J. Mlacek. Bratislava: Stimul 1997. 289 s.
* books and similar works
- Adler, Elizabeth: Žena je šťastie. Bratislava: Práca 1993. 486 s. Preklad: Anna Rácová.
- Bhagavadgíta: rozhovor Boha s človekom. Bratislava: Hevi 1997. 111 s. Preklad: Milan Polášek.
- Ondrejka, Martin: Štúdium zirkónu a jeho využitie v súčasnej magmatickej petrológii. Práca k doktorandskému minimu. Bratislava: Prírodovedecká fakulta UK 2002. 59 s.
- Pipíška, Martin, Augustín, Jozef: Možnosti a obmedzenia využitia biologických systémov pre remediáciu pôdy kontaminovanej rádionuklidmi. In: Nova Biotechnologica III, Revue fakulty prírodných vied UCM Trnava, 2003, č. 2, s. 18-31.