ELITR Minuting Corpus consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.
Czech meetings are in the computer science and public administration domains and English meetings are in the computer science domain.
Each transcript has one or multiple corresponding minutes files. Alignments are only provided for a portion of the data.
This corpus contains 59 Czech and 120 English meeting transcripts, consisting of 71097 and 87322 dialogue turns respectively. For Czech meetings, we provide 147 total minutes with 55 of them aligned. For English meetings, it is 256 total minutes with 111 of them aligned.
Please find a more detailed description of the data in the included README and stats.tsv files.
If you use this corpus, please cite:
Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O.
(2022). ELITR Minuting Corpus: A novel dataset for automatic minuting
from multi-party meetings in English and Czech. In Proceedings of the
13th International Conference on Language Resources and Evaluation
(LREC-2022), Marseille, France, June. European Language Resources
Association (ELRA). In print.
@inproceedings{elitr-minuting-corpus:2022,
author = {Anna Nedoluzhko and Muskaan Singh and Marie
Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar},
title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for
Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech},
booktitle = {Proceedings of the 13th International Conference
on Language Resources and Evaluation (LREC-2022)},
year = 2022,
month = {June},
address = {Marseille, France},
publisher = {European Language Resources Association (ELRA)},
note = {In print.}
}
A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ.
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2018101@@LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy@@nationalFunds@@✖[remove]2