Show simple item record

 
dc.contributor.author Nedoluzhko, Anna
dc.contributor.author Novák, Michal
dc.contributor.author Popel, Martin
dc.contributor.author Žabokrtský, Zdeněk
dc.contributor.author Zeldes, Amir
dc.contributor.author Zeman, Daniel
dc.contributor.author Bourgonje, Peter
dc.contributor.author Cinková, Silvie
dc.contributor.author Hajič, Jan
dc.contributor.author Hardmeier, Christian
dc.contributor.author Krielke, Pauline
dc.contributor.author Landragin, Frédéric
dc.contributor.author Lapshinova-Koltunski, Ekaterina
dc.contributor.author Martí, M. Antònia
dc.contributor.author Mikulová, Marie
dc.contributor.author Ogrodniczuk, Maciej
dc.contributor.author Recasens, Marta
dc.contributor.author Stede, Manfred
dc.contributor.author Straka, Milan
dc.contributor.author Toldova, Svetlana
dc.contributor.author Vincze, Veronika
dc.contributor.author Žitkus, Voldemaras
dc.date.accessioned 2022-04-06T12:53:57Z
dc.date.available 2022-04-06T12:53:57Z
dc.date.issued 2022-04-06
dc.identifier.uri http://hdl.handle.net/11234/1-4698
dc.description CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.0 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Version 1.0 consists of the same corpora and languages as the previous version 0.2; however, the English GUM dataset has been updated to a newer and larger version, and in the Czech/English PCEDT dataset, the train-dev-test split has been changed to be compatible with OntoNotes. Nevertheless, the main change is in the file format (the MISC attributes have new form and interpretation).
dc.language.iso cat
dc.language.iso ces
dc.language.iso nld
dc.language.iso eng
dc.language.iso fra
dc.language.iso deu
dc.language.iso hun
dc.language.iso lit
dc.language.iso pol
dc.language.iso rus
dc.language.iso spa
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/H2020/825303
dc.relation.replaces http://hdl.handle.net/11234/1-4598
dc.relation.isreplacedby http://hdl.handle.net/11234/1-5053
dc.rights Licence CorefUD v0.2
dc.rights.uri https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-0.2
dc.source.uri https://ufal.mff.cuni.cz/corefud
dc.subject dependency
dc.subject treebank
dc.subject coreference
dc.subject bridging relations
dc.subject harmonized annotation
dc.title Coreference in Universal Dependencies 1.0 (CorefUD 1.0)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Daniel Zeman zeman@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds
sponsor Grantová agentura České republiky GX20-16819X LUSyD – Language Understanding: from Syntax to Discourse nationalFunds
sponsor Grantová agentura České Republiky 19-14534S Popis slovotvorné struktury českých slov na základě jazykových dat nationalFunds
sponsor European Union EC/H2020/825303 Bergamot - Browser-based Multilingual Translation euFunds info:eu-repo/grantAgreement/EC/H2020/825303
size.info 194344 sentences
size.info 4061606 words
size.info 4112513 tokens
files.size 73320262
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Licence CorefUD v0.2
Distributed under Creative Commons
Icon
Name
CorefUD-1.0-public.zip
Size
69.92 MB
Format
application/zip
Description
data
MD5
f1a4e2301bdc5e3896546c22c6a94852
 Download file  Preview
 File Preview  
  • CorefUD-1.0-public
    • data
      • CorefUD_Polish-PCC
        • README.md1 kB
        • pl_pcc-corefud-dev.conllu6 MB
        • LICENSE.txt19 kB
        • pl_pcc-corefud-train.conllu48 MB
      • CorefUD_Czech-PCEDT
        • cs_pcedt-corefud-train.conllu105 MB
        • cs_pcedt-corefud-dev.conllu18 MB
        • README.md1 kB
        • LICENSE.txt21 kB
      • CorefUD_French-Democrat
        • README.md2 kB
        • fr_democrat-corefud-dev.conllu1 MB
        • fr_democrat-corefud-train.conllu14 MB
        • LICENSE.txt19 kB
      • CorefUD_Lithuanian-LCC
        • README.md1 kB
        • lt_lcc-corefud-train.conllu2 MB
        • LICENSE.txt1 kB
        • lt_lcc-corefud-dev.conllu298 kB
      • CorefUD_German-PotsdamCC
        • de_potsdamcc-corefud-train.conllu2 MB
        • README.md1 kB
        • de_potsdamcc-corefud-dev.conllu365 kB
        • LICENSE.txt20 kB
      • CorefUD_English-ParCorFull
        • en_parcorfull-corefud-dev.conllu69 kB
        • README.md2 kB
        • en_parcorfull-corefud-train.conllu536 kB
        • LICENSE.txt18 kB
      • CorefUD_German-ParCorFull
        • README.md2 kB
        • LICENSE.txt18 kB
        • de_parcorfull-corefud-dev.conllu88 kB
        • de_parcorfull-corefud-train.conllu692 kB
      • CorefUD_Czech-PDT
        • README.md1 kB
        • cs_pdt-corefud-train.conllu78 MB
        • cs_pdt-corefud-dev.conllu10 MB
        • LICENSE.txt20 kB
      • CorefUD_Russian-RuCor
        • ru_rucor-corefud-train.conllu10 MB
        • README.md1 kB
        • LICENSE.txt19 kB
        • ru_rucor-corefud-dev.conllu1 MB
      • CorefUD_English-GUM
        • en_gum-corefud-train.conllu10 MB
        • README.md1 kB
        • en_gum-corefud-dev.conllu1 MB
        • LICENSE.txt3 kB
      • CorefUD_Spanish-AnCora
        • README.md2 kB
        • es_ancora-corefud-train.conllu36 MB
        • es_ancora-corefud-dev.conllu4 MB
        • LICENSE.txt189 B
      • CorefUD_Catalan-AnCora
        • ca_ancora-corefud-train.conllu33 MB
        • README.md2 kB
        • ca_ancora-corefud-dev.conllu4 MB
        • LICENSE.txt189 B
      • CorefUD_Hungarian-SzegedKoref
        • hu_szegedkoref-corefud-train.conllu9 MB
        • README.md1 kB
        • LICENSE.txt18 kB
        • hu_szegedkoref-corefud-dev.conllu1 MB
    • doc
      • corefud-1.0-format.pdf160 kB
      • README.txt8 kB

Show simple item record