This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech, English, French, German, Hungarian, Polish, Spanish
and Swedish. Version 2.0 extends the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
This dataset contains annotation of PDT using Czech WordNet ontology: http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
Data is stored in PML format. This is a stand-off annotation and for most use cases it requires PDT 2.0 and the Czech WordNet 1.9 PDT that we have used for annotation. and 1ET100300517, 1ET201120505
Mapping table for the article Hajič et al., 2024: Mapping Czech Verbal Valency to PropBank Argument Labels, in LREC-COLING 2024, as preprocess by the algorithm described in the paper. This dataset i smeant for verification (replicatoin) purposes only. It will b manually processed further to arrive at a workable CzezchpropBank, to be used in Czech UMR annotation, to be further updated during the annotation. The resulting PropBank frame files fir Czech are expected to be available with some future releases of UMR, containing Czech UMR annotation, or separately.
Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
MorfFlex CZ 2.0 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex is a flat list of lemma-tag-wordform triples. For each wordform, full inflectional information is coded in a positional tag. Wordforms are organized into entries (paradigm instances or paradigms in short) according to their formal morphological behavior. The paradigm (set of wordforms) is identified by a unique lemma. Apart from traditional morphological categories, the description also contains some semantic, stylistic and derivational information. For more details see a comprehensive specification of the Czech morphological annotation http://ufal.mff.cuni.cz/techrep/tr64.pdf .
Slovak morphological dictionary modeled after the Czech one. It consists of (word form, lemma, POS tag) triples, reusing the Czech morphological system for POS tags and lemma descriptions.
This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). These corpora were meant to serve as additional "raw" corpora, to help discovering unseen verbal MWEs.
The corpora are provided in CONLL-U (https://universaldependencies.org/format.html) format. They contain morphosyntactic annotations (parts of speech, lemmas, morphological features, and syntactic dependencies). Depending on the language, the information comes from treebanks (mostly Universal Dependencies v2.x) or from automatic parsers trained on UD v2.x treebanks (e.g., UDPipe).
VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do).
For the 1.2 shared task edition, the data covers 14 languages, for which VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format.
Morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.2 (2020). The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.2
This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as the original PDT 2.0 data. It is to be used together with the PDT 2.0. and grant 1ET201120505 of the Academy of Sciences of the Czech Republic and grant MSM0021620838 of the Ministry of Youth, Education and Sport of The Czech Republic