This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308).
Each original (noisy) sentence was normalized (clean1 and clean2) and translated to English independently by two translators.
Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech
The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf
Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech
The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf
This version fixes double annotation errors in train and dev M2 files, and also contains more metadata information.
This package contains data used in the IWPT 2021 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal Dependencies release 2.7 (http://hdl.handle.net/11234/1-3424) but some treebanks contain additional enhanced annotations. Moreover, not all of these additions became part of Universal Dependencies release 2.8 (http://hdl.handle.net/11234/1-3687), which makes the shared task data unique and worth a separate release to enable later comparison with new parsing algorithms. The package also contains a number of Perl and Python scripts that have been used to process the data during preparation and during the shared task. Finally, the package includes the official primary submission of each team participating in the shared task.
Mapping table for the article Hajič et al., 2024: Mapping Czech Verbal Valency to PropBank Argument Labels, in LREC-COLING 2024, as preprocess by the algorithm described in the paper. This dataset i smeant for verification (replicatoin) purposes only. It will b manually processed further to arrive at a workable CzezchpropBank, to be used in Czech UMR annotation, to be further updated during the annotation. The resulting PropBank frame files fir Czech are expected to be available with some future releases of UMR, containing Czech UMR annotation, or separately.
We define "optimal reference translation" as a translation thought to be the best possible that can be achieved by a team of human translators. Optimal reference translations can be used in assessments of excellent machine translations.
We selected 50 documents (online news articles, with 579 paragraphs in total) from the 130 English documents included in the WMT2020 news test (http://www.statmt.org/wmt20/) with the aim to preserve diversity (style, genre etc.) of the selection. In addition to the official Czech reference translation provided by the WMT organizers (P1), we hired two additional translators (P2 and P3, native Czech speakers) via a professional translation agency, resulting in three independent translations. The main contribution of this dataset are two additional translations (i.e. optimal reference translations N1 and N2), done jointly by two translators-cum-theoreticians with an extreme care for various aspects of translation quality, while taking into account the translations P1-P3. We publish also internal comments (in Czech) for some of the segments.
Translation N1 should be closer to the English original (with regards to the meaning and linguistic structure) and female surnames use the Czech feminine suffix (e.g. "Mai" is translated as "Maiová"). Translation N2 is more free, trying to be more creative, idiomatic and entertaining for the readers and following the typical style used in Czech media, while still preserving the rules of functional equivalence. Translation N2 is missing for the segments where it was not deemed necessary to provide two alternative translations. For applications/analyses needing translation of all segments, this should be interpreted as if N2 is the same as N1 for a given segment.
We provide the dataset in two formats: OpenDocument spreadsheet (odt) and plain text (one file for each translation and the English original). Some words were highlighted using different colors during the creation of optimal reference translations; this highlighting and comments are present only in the odt format (some comments refer to row numbers in the odt file). Documents are separated by empty lines and each document starts with a special line containing the document name (e.g. "# upi.205735"), which allows alignment with the original WMT2020 news test. For the segments where N2 translations are missing in the odt format, the respective N1 segments are used instead in the plain-text format.
A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ.
RobeCzech is a monolingual RoBERTa language representation model trained on Czech data. RoBERTa is a robustly optimized Transformer-based pretraining approach. We show that RobeCzech considerably outperforms equally-sized multilingual and Czech-trained contextualized language representation models, surpasses current state of the art in all five evaluated NLP tasks and reaches state-of-theart results in four of them. The RobeCzech model is released publicly at https://hdl.handle.net/11234/1-3691 and https://huggingface.co/ufal/robeczech-base, both for PyTorch and TensorFlow.
Supplementary materials for the paper “Processing of explicit and implicit contrastive and temporal discourse relations in Czech” (submitted to Discourse Processes)
The SynSemClass 3.5 synonym verb lexicon investigates semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English and German-English language resources, i.e., relates verb meanings with respect to contextually-based verb synonymy.
The Czech lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), Vallex (http://hdl.handle.net/11234/1-3524), and CzEngVallex (http://hdl.handle.net/11234/1-1512).
The English lexicon entries are linked to EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (https://uvi.colorado.edu/ and http://verbs.colorado.edu/verbnet/index.html), PropBank (http://propbank.github.io/), Ontonotes (http://clear.colorado.edu/compsem/index.php?page=lexicalresources&sub=ontonotes), and English Wordnet (https://wordnet.princeton.edu/).
The German lexicon entries are linked to Woxikon (https://synonyme.woxikon.de), E-VALBU (https://grammis.ids-mannheim.de/verbvalenz), and GUP (http://alanakbik.github.io/multilingual.html; https://github.com/UniversalDependencies/UD_German-GSD).