ELITR Minuting Corpus consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.
Czech meetings are in the computer science and public administration domains and English meetings are in the computer science domain.
Each transcript has one or multiple corresponding minutes files. Alignments are only provided for a portion of the data.
This corpus contains 59 Czech and 120 English meeting transcripts, consisting of 71097 and 87322 dialogue turns respectively. For Czech meetings, we provide 147 total minutes with 55 of them aligned. For English meetings, it is 256 total minutes with 111 of them aligned.
Please find a more detailed description of the data in the included README and stats.tsv files.
If you use this corpus, please cite:
Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O.
(2022). ELITR Minuting Corpus: A novel dataset for automatic minuting
from multi-party meetings in English and Czech. In Proceedings of the
13th International Conference on Language Resources and Evaluation
(LREC-2022), Marseille, France, June. European Language Resources
Association (ELRA). In print.
@inproceedings{elitr-minuting-corpus:2022,
author = {Anna Nedoluzhko and Muskaan Singh and Marie
Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar},
title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for
Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech},
booktitle = {Proceedings of the 13th International Conference
on Language Resources and Evaluation (LREC-2022)},
year = 2022,
month = {June},
address = {Marseille, France},
publisher = {European Language Resources Association (ELRA)},
note = {In print.}
}
This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al.
The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format.
The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation.
Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
SumeCzech-NER
SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset).
Format
The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are:
- dataset: train, dev, test, oodtest
- ne_abstract: list of named entity annotations of article's abstract
- ne_headline: list of named entity annotations of article's headline
- ne_text: list of name entity annotations of article's text
- url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER
Annotations
We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions.
Tokenization
We used the following Python code for tokenization:
from typing import List
from nltk.tokenize import word_tokenize
def tokenize(text: str) -> List[str]:
for mark in ('.', ',', '?', '!', '-', '–', '/'):
text = text.replace(mark, f' {mark} ')
tokens = word_tokenize(text)
return tokens