CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.0 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Version 1.0 consists of the same corpora and languages as the previous version 0.2; however, the English GUM dataset has been updated to a newer and larger version, and in the Czech/English PCEDT dataset, the train-dev-test split has been changed to be compatible with OntoNotes. Nevertheless, the main change is in the file format (the MISC attributes have new form and interpretation).
CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.1 consists of 21 datasets for 13 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 17 datasets for 12 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 2 for Hungarian, 1 for Lithuanian, 2 for Norwegian, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Compared to the previous version 1.0, the version 1.1 comprises new languages and corpora, namely Hungarian-KorKor, Norwegian-BokmaalNARC, Norwegian-NynorskNARC, and Turkish-ITCC. In addition, the English GUM dataset has been updated to a newer and larger version, and the conversion pipelines for most datasets have been refined (a list of all changes in each dataset can be found in the corresponding README file).
CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.2 consists of 25 datasets for 16 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 21 datasets for 15 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 1 for French, 2 for German, 2 for Hungarian, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource, too. Compared to the previous version 1.1, the version 1.2 comprises new languages and corpora, namely Ancient_Greek-PROIEL, Ancient_Hebrew-PTNK, English-LitBank, and Old_Church_Slavonic-PROIEL. In addition, English-GUM and Turkish-ITCC have been updated to newer versions, conversion of zeros in Polish-PCC has been improved, and the conversion pipelines for multiple other datasets have been refined (a list of all changes in each dataset can be found in the corresponding README file).
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10).
One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.1; April 2016) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-underspecified logical forms and more abstract, non-lexicalized ‘semantic networks’. The latter of these are formally (if not linguistically) similar to Abstract Meaning Representation (AMR) and are available in a range of serializations, including in AMR-like syntax.
Please use the following bibliographic reference for the SDP 2016 data:
@string{C:LREC = {{I}nternational {C}onference on
{L}anguage {R}esources and {E}valuation}}
@string{LREC:16 = {Proceedings of the 10th } # C:LREC}
@string{L:LREC:16 = {Portoro\v{z}, Slovenia}}
@inproceedings{Oep:Kuh:Miy:16,
author = {Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke
and Zeman, Daniel and Cinkov{\'a}, Silvie
and Flickinger, Dan and Haji\v{c}, Jan
and Ivanova, Angelina and Ure\v{s}ov{\'a}, Zde\v{n}ka},
title = {Towards Comparability of Linguistic Graph Banks for Semantic Parsing},
booktitle = LREC:16
year = 2016,
address = L:LREC:16,
pages = {3991--3995}
}
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10).
One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.2; January 2017) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-underspecified logical forms and more abstract, non-lexicalized ‘semantic networks’. The latter of these are formally (if not linguistically) similar to Abstract Meaning Representation (AMR) and are available in a range of serializations, including in AMR-like syntax.
Version 1.1 was released April 2016. Version 1.2 adds the 2015 Turku system, which was accidentally left out from version 1.1.
Please use the following bibliographic reference for the SDP 2016 data:
@string{C:LREC = {{I}nternational {C}onference on
{L}anguage {R}esources and {E}valuation}}
@string{LREC:16 = {Proceedings of the 10th } # C:LREC}
@string{L:LREC:16 = {Portoro\v{z}, Slovenia}}
@inproceedings{Oep:Kuh:Miy:16,
author = {Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke
and Zeman, Daniel and Cinkov{\'a}, Silvie
and Flickinger, Dan and Haji\v{c}, Jan
and Ivanova, Angelina and Ure\v{s}ov{\'a}, Zde\v{n}ka},
title = {Towards Comparability of Linguistic Graph Banks for Semantic Parsing},
booktitle = LREC:16
year = 2016,
address = L:LREC:16,
pages = {3991--3995}
}
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). This is the second release of UD Treebanks, Version 1.1.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).