This is a Czech Named Entity Corpus 2.0 transformed into the CoNLL format. The original corpus can be downloaded from: http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8. The CoNLL transformation is described in this publication: https://link.springer.com/chapter/10.1007/978-3-642-40585-3_20.
CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.0 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Version 1.0 consists of the same corpora and languages as the previous version 0.2; however, the English GUM dataset has been updated to a newer and larger version, and in the Czech/English PCEDT dataset, the train-dev-test split has been changed to be compatible with OntoNotes. Nevertheless, the main change is in the file format (the MISC attributes have new form and interpretation).
CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.2 consists of 25 datasets for 16 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 21 datasets for 15 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 1 for French, 2 for German, 2 for Hungarian, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource, too. Compared to the previous version 1.1, the version 1.2 comprises new languages and corpora, namely Ancient_Greek-PROIEL, Ancient_Hebrew-PTNK, English-LitBank, and Old_Church_Slavonic-PROIEL. In addition, English-GUM and Turkish-ITCC have been updated to newer versions, conversion of zeros in Polish-PCC has been improved, and the conversion pipelines for multiple other datasets have been refined (a list of all changes in each dataset can be found in the corresponding README file).
Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging.
The morphological dictionary is created from MorfFlex CZ 2.0, DeriNet 2.1 and the PoS tagger is trained on Prague Dependency Treebank - Consolidated 1.0. and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The Czech morphologic system was devised by Jan Hajič.
The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová.
The morphologic guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník.
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
The Czech translation of SQuAD 2.0 and SQuAD 1.1 datasets contains automatically translated texts, questions and answers from the training set and the development set of the respective datasets.
The test set is missing, because it is not publicly available.
The data is released under the CC BY-NC-SA 4.0 license.
If you use the dataset, please cite the following paper (the exact format was not available during the submission of the dataset): Kateřina Macková and Straka Milan: Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer, presented at TSD 2020, Brno, Czech Republic, September 8-11 2020.
POS Tagger and Lemmatizer models for EvaLatin2020 data (https://github.com/CIRCSE/LT4HALA). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#evalatin20_models .
To use these models, you need UDPipe version at least 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech
The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf
Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech
The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf
This version fixes double annotation errors in train and dev M2 files, and also contains more metadata information.
MorfFlex CZ 2.0 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex is a flat list of lemma-tag-wordform triples. For each wordform, full inflectional information is coded in a positional tag. Wordforms are organized into entries (paradigm instances or paradigms in short) according to their formal morphological behavior. The paradigm (set of wordforms) is identified by a unique lemma. Apart from traditional morphological categories, the description also contains some semantic, stylistic and derivational information. For more details see a comprehensive specification of the Czech morphological annotation http://ufal.mff.cuni.cz/techrep/tr64.pdf .
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2018101@@LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy@@nationalFunds@@✖[remove]24