Creator: Skoumalová, Hana / Rights: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

1. AKCES-GEC Grammatical Error Correction Dataset for Czech

Creator:: Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Straka, Milan, Rosen, Alexandr, Náplava, Jakub, and Poláčková, Marie
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: natural language correction, grammatical error correction, and gec
Language:: Czech
Description:: AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences. If you use this dataset, please use following citation: @article{naplava2019wnut, title={Grammatical Error Correction in Low-Resource Scenarios}, author={N{\'a}plava, Jakub and Straka, Milan}, journal={arXiv preprint arXiv:1910.00353}, year={2019} }
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

2. Etalon 1.0

Creator:: Skoumalová, Hana
Publisher:: Charles University, Faculty of Arts, Institute of Theoretical and Computational Linguistics
Type:: text and corpus
Subject:: annotated corpus and morphological annotation
Language:: Czech
Description:: Etalon is a manually annotated corpus of contemporary Czech. The corpus contains 1,885,589 words (2,265,722 tokens) and is annotated in the same way as SYN2020 of the Czech National Corpus. The corpus includes fiction (ca 24%), professional and scientific literature (ca 40%) and newspapers (ca 36%). The corpus is provided in a vertical format, where sentence boundaries are marked with a blank line. Every word form is written on a separate line, followed by five tab-separated attributes: syntactic word, lemma, sublemma, tag and verbtag. The texts are shuffled in random chunks of 100 words at maximum (respecting sentence boundaries).
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

3. FicTree 1.0

Creator:: Jelínek, Tomáš, Hnátková, Milena, and Skoumalová, Hana
Publisher:: Charles University, Faculty of Arts, Institute of Theoretical and Computational Linguistics
Type:: text and corpus
Subject:: treebank
Language:: Czech
Description:: FicTree is a dependency treebank of Czech fiction manually annotated in the format of the analytical layer of the Prague Dependency Trebank. The treebank consists of 12,760 sentences (166,432 tokens). The texts come from eight literary works published in the Czech Republic between 1991 and 2007. The syntactic annotation of the treebank was first performed by two distinct parsers (MSTParser and MaltParser) trained on the PDT training data, then manually corrected. Any differences between the two versions were resolved manually (by another annotator). The corpus is provided in a vertical format, where sentence boundaries are marked with a blank line. Every word form is written on a separate line, followed by five tab-separated attributes: lemma, tag, ID (word index in the sentence), head and deprel (analytical function, afun in the PDT formalism). The texts are shuffled in random chunks of maximum 100 words (respecting sentence boundaries). Each chunk is provided as a separate file, with the suggested division into train, dev and test sets written as file prefix.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

1. AKCES-GEC Grammatical Error Correction Dataset for Czech

2. Etalon 1.0

3. FicTree 1.0

Limit your search

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Creator

Show values starting with

Language

Publisher

Rights

Subject

Type

Original context has metadata only

Harvested from