Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized.
All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better.
The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language resources, i.e., relating verb meanings with respect to contextually-based verb synonymy. The lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (http://verbs.colorado.edu/verbnet/index.html), PropBank (http://verbs.colorado.edu/%7Empalmer/projects/ace.html), Ontonotes (http://verbs.colorado.edu/html_groupings/), and Czech (http://hdl.handle.net/11858/00-097C-0000-0001-4880-3) and English Wordnets (https://wordnet.princeton.edu/). Part of the dataset is a file reflecting annotators choices for assignment of verbs to classes.
The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language resources, i.e., relating verb meanings with respect to contextually-based verb synonymy. The lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (http://verbs.colorado.edu/verbnet/index.html), PropBank (http://verbs.colorado.edu/%7Empalmer/projects/ace.html), Ontonotes (http://verbs.colorado.edu/html_groupings/), and Czech (http://hdl.handle.net/11858/00-097C-0000-0001-4880-3) and English Wordnets (https://wordnet.princeton.edu/). Part of the dataset are files reflecting annotators choices and agreement for assignment of verbs to classes.
The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language resources, i.e., relating verb meanings with respect to contextually-based verb synonymy. The lexicon entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), VerbNet (http://verbs.colorado.edu/verbnet/index.html), PropBank (http://verbs.colorado.edu/%7Empalmer/projects/ace.html), Ontonotes (http://verbs.colorado.edu/html_groupings/), and Czech (http://hdl.handle.net/11858/00-097C-0000-0001-4880-3) and English Wordnets (https://wordnet.princeton.edu/).
This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308).
Each original (noisy) sentence was normalized (clean1 and clean2) and translated to English independently by two translators.
Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
MorfFlex CZ 2.0 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex is a flat list of lemma-tag-wordform triples. For each wordform, full inflectional information is coded in a positional tag. Wordforms are organized into entries (paradigm instances or paradigms in short) according to their formal morphological behavior. The paradigm (set of wordforms) is identified by a unique lemma. Apart from traditional morphological categories, the description also contains some semantic, stylistic and derivational information. For more details see a comprehensive specification of the Czech morphological annotation http://ufal.mff.cuni.cz/techrep/tr64.pdf .
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10).
One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.1; April 2016) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-underspecified logical forms and more abstract, non-lexicalized ‘semantic networks’. The latter of these are formally (if not linguistically) similar to Abstract Meaning Representation (AMR) and are available in a range of serializations, including in AMR-like syntax.
Please use the following bibliographic reference for the SDP 2016 data:
@string{C:LREC = {{I}nternational {C}onference on
{L}anguage {R}esources and {E}valuation}}
@string{LREC:16 = {Proceedings of the 10th } # C:LREC}
@string{L:LREC:16 = {Portoro\v{z}, Slovenia}}
@inproceedings{Oep:Kuh:Miy:16,
author = {Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke
and Zeman, Daniel and Cinkov{\'a}, Silvie
and Flickinger, Dan and Haji\v{c}, Jan
and Ivanova, Angelina and Ure\v{s}ov{\'a}, Zde\v{n}ka},
title = {Towards Comparability of Linguistic Graph Banks for Semantic Parsing},
booktitle = LREC:16
year = 2016,
address = L:LREC:16,
pages = {3991--3995}
}
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10).
One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.2; January 2017) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-underspecified logical forms and more abstract, non-lexicalized ‘semantic networks’. The latter of these are formally (if not linguistically) similar to Abstract Meaning Representation (AMR) and are available in a range of serializations, including in AMR-like syntax.
Version 1.1 was released April 2016. Version 1.2 adds the 2015 Turku system, which was accidentally left out from version 1.1.
Please use the following bibliographic reference for the SDP 2016 data:
@string{C:LREC = {{I}nternational {C}onference on
{L}anguage {R}esources and {E}valuation}}
@string{LREC:16 = {Proceedings of the 10th } # C:LREC}
@string{L:LREC:16 = {Portoro\v{z}, Slovenia}}
@inproceedings{Oep:Kuh:Miy:16,
author = {Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke
and Zeman, Daniel and Cinkov{\'a}, Silvie
and Flickinger, Dan and Haji\v{c}, Jan
and Ivanova, Angelina and Ure\v{s}ov{\'a}, Zde\v{n}ka},
title = {Towards Comparability of Linguistic Graph Banks for Semantic Parsing},
booktitle = LREC:16
year = 2016,
address = L:LREC:16,
pages = {3991--3995}
}
The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool), and also in more human readable form including corpus examples (see the WEBSITE link below). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.