The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors. and European Commission Sixth Framework Programme
Information Society Technologies Integrated Project IST-34434
PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcription and manually reconstructed text.
PDTSC 1.0 is a delayed release of data annotated in 2012. It is an update of Prague Dependency Treebank of Spoken Language (PDTSL) 0.5 (published in 2009). In 2017, Prague Dependency Treebank of Spoken Czech (PDTSC) 2.0 was published as an update of PDTSC 1.0.
The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts, i.e., all annotation from PDT 1.0, PDT 2.0, PDT 2.5, PDT 3.0, PDiT 1.0 and PDiT 2.0, plus corrections, new structure of basic documentation and new list of authors covering all previous editions. The Prague Dependency Treebank 3.5 (PDT 3.5) contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (832,823 words) on all layers, from tectogrammatical annotation to syntax to morphology. There are additional annotated sentences for syntax and morphology; the totals for the lower layers of annotation are: 87,913 sentences with 1,502,976 words at the analytical layer (surface dependency syntax) and 115,844 sentences with 1,956,693 words at the morphological layer of annotation (these totals include the annotation with the higher layers annotated as well). Closely linked to the tectogrammatical layer is the annotation of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations.
The first edition of a speech corpus with a speech reconstruction layer (edited transcript).
The project of speech reconstruction of Czech and English has been started at UFAL together with the PIRE project in 2005, and has gradually grown from ideas to (first) annotation specification, annotation software and actual annotation. It is part of the Prague Dependency Treebank family of annotated corpus resources and tools, to which it adds the spoken language layer(s). and LC536; MSM0021620838; IST-034344; ME838