============================================================= Human Label Variation in Attribution and Discourse (Hlava AD) ============================================================= Authors ======= Šárka Zikánová (Charles University, Faculty of Mathematics and Physics), Jiří Mírovský (Charles University, Faculty of Mathematics and Physics), Anna Nedoluzhko (Charles University, Faculty of Mathematics and Physics) Eva Hajičová (Charles University, Faculty of Mathematics and Physics), Šárka Dohnalová (Charles University, Faculty of Arts) Anna Kmječová Eliška Nodlová (Charles University, Faculty of Arts) Dominik Teska (Charles University, Faculty of Arts) Introduction ============ Human Label Variation in Attribution and Discourse (Hlava AD) is a collection of commented multiple annotations (5 annotators) of inter-sentential explicit discourse relations between complex sentences containing verbs of attribution (saying, thinking) and following sentences in Czech. The main aim of the annotation is to capture how often the following sentence is seen as a follow-up of the direct/reported speech OR the author's speech. The dataset contains fillers (complex sentences with other types of verbs). Please visit https://ufal.mff.cuni.cz/hvar/hlava-ad for detailed and updated information about the corpus. Data Description ================ Hlava AD comprises 512 sentence pairs (221 attributions, 291 fillers), each accompanied by the preceding context. Each pair is annotated independently by five annotators in parallel. The annotators' decisions are documented in the "Commentary" column. Additionally, each annotator indicates whether they required the previous context to make their annotation. The texts come from the Prague Dependency Treebank - Consolidated 1.0 (Hajič et al., 2020). Data Source and Format ====================== After downloading the corpus from http://hdl.handle.net/11234/1-5819, the annotations can be found in directory data. The annotations are available in three data formats: - Hlava_AD.xls - MS Excel table - Hlava_AD.ods - Open Document Format Spreadsheet - Hlava_AD.tsv - tab-separated values, UTF-8, no column headings Annotation instructions (in Czech) can be found in directory doc. Description of the column format ================================ The column format used in Hlava AD consists of 42 fields: The fields A-F are related to the identification of the items in the Hlava AD corpus and in the source corpora. The fields G-K carry additional information about their original discourse annotation in the source corpora. The fields L-O contain the annotators' prompts (preceding context, a pair of observed sentences and the discourse connective). The fields P-AA summarize the results of the multiple annotations (sums and overviews of target clauses of the discourse relations and of the levels how far the annotators needed the previous context for understanding). The fields AB-AP comprise the individual annotations. For each annotator, the target clause of the discourse relation, the level of the need for context and their free commentaries are introduced. Items identifications --------------------- ID: Item identification number in the Hlava AD corpus Role: - attribution (the left sentence contains a finite verb of attribution) - filler (the left sentence does not contain verbs of attribution) Source ------ Subcorpus: - PDTSC (Prague Dependency Treebank of Spoken Czech) - PDT (Prague Dependency Treebank) Mode: - spoken - written File: file name in the source corpus Start node id: identifier of the starting node of the discourse relation in the source corpus (can be in the left or in the right sentence, depending on the semantics of the relation). PDT-C (annotation in the source corpora) ---------------------------------------- Attribution node = target: - yes (the node of the finite attribution verb in the left sentence is annotated as a target (see Note 1) of the discourse relation coming from the right sentence) - no (the discourse relation coming from the right sentence has a different target in the left sentence) - N/A (in filler items: there is no attribution verb there. The discourse relation coming from the right sentence has a different target in the left sentence.) [Note 1: For the sake of this annotation, the term “target” means simply the other discourse argument connected to the right sentence by the discourse connective; prototypically, it is supposed to be placed in the left sentence; it does not reflect any semantic (temporal, causal) sequence of the discourse arguments.] Target root: the word form serving as a root of the target clause in the left sentence (e.g. řekl, “said”, mainly a finite verb or a conjunction) Governing verb (lemma): the tectogrammatic lemma of the governing (attribution) verb in the left sentence (e.g., říci, “to_say”) Root of direct speech: - yes (the root of the content (dependent) clause in the left sentence is described as a root of a direct speech in the PDT-C, with the value is_dsp_root = 1) - no (the root of the content (dependent) clause in the left sentence is not assigned this value in the PDT-C; is_dsp_root is not set) Discourse relation: semantic type of the discourse relation between the left and the right sentences. (The set of semantic labels in the Prague Discourse Treebank and related corpora is available, e.g., in Zikánová et al., 2015). Annotators‘ prompt ------------------ Preceding context: three sentences immediately preceding the pair of sentences under observation Left sentence: a complex sentence consisting at least of one governing and one dependent clause. In attribution items, the governing verbs are verbs of attribution; in filler items, they are of other semantic types. The potential targets of a discourse relation coming from the right sentence are marked with red (finite verbs) and blue (conjunctions, punctuation marks for clause coordination). Right sentence: the second sentence of the discourse relation, with a free syntactic structure. The discourse connective is marked with green. (It can occur in the left sentence, too, be it a part of the discourse connective or the DC as a whole. In these cases, it is marked with green as well.) Discourse connective (PDT-C): the form of the discourse connective as captured in the source corpus (e.g., ale, “but”). Results summary --------------- (a) Target variation: overview of the targets (mainly finite verbs) in the left sentence annotators have chosen Target roots Σ: sum of different target roots annotations (with 1 meaning a full inter-annotator agreement and 5 corresponding to 5 different solutions) Ann1-5: overview of individual annotators’ solutions, e.g. řekl – má – má – řekl – řekl (said – has – has – said – said) (b) Context required variation: overview of the levels how far the annotators felt a need for the preceding context Context required Σ: sum of different levels of context requests (with 1 meaning the same level of need for all the annotators and 3 meaning the maximal difference in the need) Ann1-5: overview of individual annotators’ solutions, e.g. 1-0-0-2-2 - 0 (I did not need any preceding context to understand how the discourse connective relates the right sentence to the left one) - 1 (I needed the preceding context, it helped me to understand what the discourse connective relates the right sentence to) - 2 (I used the preceding context but it was not enough to understand what the discourse connective relates the right sentence to) Individual annotations ---------------------- Ann1-5: annotators’ 1-5 individual solutions Target root: the root in the target clause of the discourse relation coming from the right sentence and expressed by the discourse connective marked with green - verb form (one of the finite verbs proposed to annotators in the left sentence and marked with red, e.g. řekl, “said”) - conjunctions and punctuation marks (one of the solutions proposed in the left sentence, marked with blue, e.g. a, “and”) - ! (the target is not in the left sentence, it is in the previous context or further on the left side) - ? (I do not understand the relation) Commentary: free text – why the annotators see the target in a certain clause, explanations, paraphrases, hesitation, why they exclude some solutions Citation ======== Please cite Hlava AD when using the corpus for your research: Šárka Zikánová, Jiří Mírovský, Anna Nedoluzhko, Eva Hajičová, Šárka Dohnalová, Anna Kmječová, Eliška Nodlová and Dominik Teska: Human Label Variation in Attribution and Discourse. Data/software, ÚFAL MFF UK, Prague, Czech Republic, LINDAT/CLARIAH-CZ: http://hdl.handle.net/11234/1-5819, Dec 2024. Licence ======= Hlava AD is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0) licence. For more information and updates, see https://ufal.mff.cuni.cz/hvar/hlava-ad Reference ========= Zikánová, Š. et al. 2015. Discourse and Coherence. Prague: Charles University. Accessible from https://ufal.mff.cuni.cz/pmltqdoc/Discourse_101_Book_Chapter_8.pdf Hajič, J. et al. 2020. Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0). Data/software, LINDAT-CLARIAH, URL: http://hdl.handle.net/11234/1-3185. Acknowledgement =============== The work on Hlava AD was financed by GAČR project 24-11132S "Disagreement in corpus annotation and variation in human understanding of text". This work was using language resources developed, stored or distributed by the LINDAT/CLARIAH-CZ project of the Ministry of Education of the Czech Republic (project LM2023062).