========================================
Prague Discourse Treebank 4.0 (PDiT 4.0)
========================================


Authors
=======
Pavlína Synková (Charles University, Faculty of Mathematics and Physics),
Jiří Mírovský (Charles University, Faculty of Mathematics and Physics),
Marie Paclíková (Charles University, Faculty of Arts),
Lucie Poláková (Charles University, Faculty of Mathematics and Physics),

and from previous versions:
Magdaléna Rysová, Veronika Scheller, Jana Zdeňková, Šárka Zikánová, Eva Hajičová


Introduction
============

The Prague Discourse Treebank 4.0 (PDiT 4.0; Synková et al., 2024) is
an annotation of discourse relations marked by primary and secondary
discourse connectives in the whole data of the Prague Dependency
Treebank - Consolidated 2.0 (PDT-C 2.0; Hajič et al., 2024).
With respect to the previous versions, annotating discourse relations
in the whole PDT-C 2.0 means a significant increase in the size
of the annotated data.

PDiT 4.0 annotates discourse relations in the whole PDT-C 2.0,
i.e. in all of its four subcorpora:

- Prague Dependency Treebank (PDT); in the previous versions,
  PDiT only covered these data
- Prague Czech-English Dependency Treebank, the Czech part (PCEDT-cz);
  newly annotated in PDiT 4.0
- Prague Dependency Treebank of Spoken Czech (PDTSC); newly annotated
  in PDiT 4.0
- Faust; newly annotated in PDiT 4.0

Already since version 3.0, PDiT uses two taxonomies of types of discourse
relations:

- the original Prague taxonomy of discourse types, and
- the Penn Discourse Treebank 3.0 (PDTB 3.0; Prasad et al., 2019) taxonomy
  of discourse senses.

Also since version 3.0, the PDiT data are offered in two formats:

- the native format of the PDT-C 2.0, i.e. the Prague Markup Language,
  where the discourse relations are annotated on top of deep-syntax
  dependency trees (tectogrammatics), and
- the Penn Discourse Treebank 3.0 (PDTB 3.0; Prasad et al., 2019) format
  of stand-off discourse annotation on plain texts.

Please visit https://ufal.mff.cuni.cz/pdit4.0 for detailed and
updated information about the corpus.

Published in December 2024.


How to Get the Data
===================

The Prague Discourse Treebank 4.0 can be downloaded from the LINDAT-CLARIAH-CZ
repository in its two data formats:

- PDiT 4.0 in the PML data format, as a part of the PDT-C 2.0 (all underlying
  annotation layers + discourse annotation as a part of the tectogrammatical
  layer): http://hdl.handle.net/11234/1-5813,
- PDiT 4.0 in the PDTB 3.0 data format (raw texts + stand-off discourse
  annotation): http://hdl.handle.net/11234/1-5680.


Data Format
===========

Data in the PML format
----------------------

Tree editor TrEd (Pajas and Štěpánek, 2008) can be used to open and browse
the data in the PML format. The editor can be downloaded for various
platforms from its home page (https://ufal.mff.cuni.cz/tred/). Please
follow installation instructions specified at the page for your
operating system.

After the installation, an extension needs to be installed:

  Start TrEd.
  In the top menu, select Setup -> Manage Extensions...; a dialog window 
    with a list of installed extensions appears.
  Click on the button "Get New Extensions"; a dialog window with a list
    of available (not yet installed) extensions appears.
  Make sure that at least the appropriate version of the extension "Prague
    Dependency Treebank - Consolidated" is checked to install (if it is not
    in the list, it may have already been installed).
  Click on the button "Install Selected"; the selected extensions (along
    with their dependencies) get installed.
  Close all TrEd windows including the main application window and start
    TrEd again.

To see the discourse annotation of a document on the tectogrammatical layer,
open the respective file with extension .t.gz. By default, orange discourse
arrows are displayed without any additional info. Press 'd' to see more
discourse-related information.

In case of troubles with the installation of TrEd or with browsing the data,
please contact the authors at (tred at ufal.mff.cuni.cz).

Data in the PDTB format
-----------------------

Please find a description of the fields of the PDTB stand-off format in
the Appendix.

The data are compatible with the Penn Discourse Treebank annotation tool
Annotator (Lee et al., 2016).


Citation
========

Please cite PDiT 4.0 when using the corpus for your research:

Pavlína Synková, Jiří Mírovský, Marie Paclíková, Lucie Poláková,
Magdaléna Rysová, Veronika Scheller, Jana Zdeňková, Šárka Zikánová
and Eva Hajičová: Prague Discourse Treebank 4.0. Data/software,
ÚFAL MFF UK, Prague, Czech Republic, LINDAT/CLARIAH-CZ:
http://hdl.handle.net/11234/1-5680, Dec 2024


Licence
=======

PDiT 4.0 is distributed under the Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0) licence.

For more information and updates, see
https://ufal.mff.cuni.cz/pdit4.0


Acknowledgement
===============

The work on version 4.0 of the corpus was financed by GAČR project
22-03269S "Methods for rapid discourse annotation in selected corpora". 

This work has been using language resources developed, stored or
distributed by the LINDAT/CLARIAH-CZ project of the Ministry
of Education of the Czech Republic (project LM2023062).


Appendix - Description of the column format
===========================================

The column format used in PDiT 4.0 consists of 44 fields.
Fields 0-33 correspond to fields defined in the PDTB 3.0 and
their description is taken from the PDTB 3.0 annotation manual
(and slightly adapted for PDiT 4.0),
fields 34-43 carry additional information added only in PDiT 4.0.
Some of the fields are not used in PDiT 4.0 but they are kept
for compatibility with the PDTB 3.0 format (they are marked
as 'not used'):

0 - Relation Type - Explicit, AltLex, AltLexC
1 - Conn SpanList - SpanList of the Explicit Connective or
    the AltLex/AltLexC selection
2 - Conn Src - Connective’s Source (not used)
3 - Conn Type - Connective’s Type (not used)
4 - Conn Pol - Connective’s Polarity (not used)
5 - Conn Det - Connective’s Determinacy (not used)
6 - Conn Feat SpanList - Connective’s Feature SpanList (not used)
7 - Conn1 - Explicit Connective Head
8 - SClass1A - Semantic Class of the Connective
9 - SClass1B - Second Semantic Class of the First Connective (not used)
10 - Conn2 - Second Implicit Connective (not used)
11 - SClass2A - First Semantic Class of the Second Connective (not used)
12 - SClass2B - Second Semantic Class of the Second Connective (not used)
13 - Sup1 - SpanList SpanList of the First Argument’s Supplement
14 - Arg1 - SpanList SpanList of the First Argument
15 - Arg1 Src - First Argument’s Source (not used)
16 - Arg1 Type - First Argument’s Type (not used)
17 - Arg1 Pol - First Argument’s Polarity (not used)
18 - Arg1 Det - First Argument’s Determinacy (not used)
19 - Arg1 Feat SpanList - SpanList of the First Argument’s Feature (not used)
20 - Arg2 SpanList - SpanList of the Second Argument
21 - Arg2 Src - Second Argument’s Source (not used)
22 - Arg2 Type - Second Argument’s Type (not used)
23 - Arg2 Pol - Second Argument’s Polarity (not used)
24 - Arg2 Det - Second Argument’s Determinacy (not used)
25 - Arg2 Feat SpanList - SpanList of the Second Argument’s Feature (not used)
26 - Sup2 SpanList - SpanList of the Second Argument’s Supplement
27 - Adju Reason - The Adjudication Reason (not used)
28 - Adju Disagr - The type of the Adjudication disagreement (not used)
29 - PB Role - The PropBank role of the PropBank verb (not used)
30 - PB Verb - The PropBank verb of the main clause of this relation (not used)
31 - Offset - The Conn SpanList of Explicit/AltLex/AltLexC tokens
32 - Provenance - Indicates whether the token is a new PDTB3 token
     or has a corresponding PDTB2 token (not used)
33 - Link - The link id of the token
34 - Discourse Type - The original discourse type in the Prague taxonomy
35 - Conn Text - Text representation of field 31 (Offset)
36 - Conn Feat Text - Text representation of field 6 (Conn Feat SpanList) (not used)
37 - Sup1 Text - Text representation of field 13 (Sup1 SpanList)
38 - Arg1 Text - Text representation of field 14 (Arg1 SpanList)
39 - Arg1 Feat Text - Text representation of field 19 (Arg1 Feat SpanList) (not used)
40 - Arg2 Text - Text representation of field 20 (Arg2 SpanList)
41 - Arg2 Feat Text - Text representation of field 25 (Arg2 Feat SpanList) (not used)
42 - Sup2 Text - Text representation of field 26 (Sup2 SpanList
43 - Genre - The genre of the document


References
==========

Jan Hajič et al.: Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0).
Data/software, LINDAT-CLARIAH, URL: http://hdl.handle.net/11234/1-5813, 2024.

Alan Lee, Rashmi Prasad, Bonnie Webber and Aravind Joshi: Annotating
discourse relations with the PDTB Annotator. In Proceedings of
COLING 2016, the 26th International Conference on Computational
Linguistics: System Demonstrations, pp. 121-125, 2016.

Petr Pajas and Jan Štěpánek: Recent Advances in a Feature-Rich Framework
for Treebank Annotation. In: The 22nd International Conference on
Computational Linguistics - Proceedings of the Conference, The Coling
2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3,
pp. 673-680, 2008.

Rashmi Prasad, Bonnie Webber, Alan Lee and Aravind Joshi: 
Penn Discourse Treebank Version 3.0. Data/Software, Linguistic Data
Consortium, University of Pennsylvania, Philadelphia, LDC2019T05, 2019.

Pavlína Synková, Magdaléna Rysová, Jiří Mírovský, Lucie Poláková,
Veronika Scheller, Jana Zdeňková, Šárka Zikánová and Eva Hajičová:
Prague Discourse Treebank 3.0. Data/software, ÚFAL MFF UK, Prague,
Czech Republic, LINDAT/CLARIAH-CZ: http://hdl.handle.net/11234/1-4875,
Dec 2022.

Pavlína Synková, Jiří Mírovský, Marie Paclíková, Lucie Poláková,
Magdaléna Rysová, Veronika Scheller, Jana Zdeňková, Šárka Zikánová
and Eva Hajičová: Prague Discourse Treebank 4.0. Data/software,
ÚFAL MFF UK, Prague, Czech Republic, LINDAT/CLARIAH-CZ:
http://hdl.handle.net/11234/1-5680, Dec 2024