CoNLL 2017 and 2018 shared tasks:
Multilingual Parsing from Raw Text to Universal Dependencies
This package contains the test data in the form in which they ware presented
to the participating systems: raw text files and files preprocessed by UDPipe.
The metadata.json files contain lists of files to process and to output;
README files in the respective folders describe the syntax of metadata.json.
For full training, development and gold standard test data, see
Universal Dependencies 2.0 (CoNLL 2017)
Universal Dependencies 2.2 (CoNLL 2018)
See the download links at http://universaldependencies.org/.
For more information on the shared tasks, see
http://universaldependencies.org/conll17/
http://universaldependencies.org/conll18/
Contents:
conll17-ud-test-2017-05-09 ... CoNLL 2017 test data
conll18-ud-test-2018-05-06 ... CoNLL 2018 test data
conll18-ud-test-2018-05-06-for-conll17 ... CoNLL 2018 test data with metadata
and filenames modified so that it is digestible by the 2017 systems.
Supplementary files for a comparative study of word-formation without the addition of derivational affixes (conversion) in English and Czech.
The two .csv files contain 300 verb-noun conversion pairs in English and 300 verb-noun conversion pairs in Czech, i.e. pairs where either the noun is created from the verb or the verb is created from the noun without the use of derivational affixes. In English, the noun and verb in the conversion pair have the same form. In Czech, the noun and verb in the conversion pair differ in inflectional affixes.
The pairs are supplied with manual semantic annotation based on cognitive event schemata.
A file with the Appendix includes a list of dictionary definition phrases used as a basis for the semantic annotation.
Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation scheme for many languages. The annotation scheme consists of simple tab-separated columns that stores a word and its morphological segmentations, including pieces of information about the word and the segmented units, e.g., part-of-speech categories, type of morphs/morphemes etc. The current public version of the collection contains 38 harmonised segmentation datasets covering 30 different languages.