Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) / Rights: http://creativecommons.org/licenses/by-sa/3.0/

Start Over Publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) Rights http://creativecommons.org/licenses/by-sa/3.0/

1. CzeSL Grammatical Error Correction Dataset (CzeSL-GEC)

Creator:: Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Straka, Milan, Rosen, Alexandr, Náplava, Jakub, and Poláčková, Marie
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: natural language correction and grammatical error correction
Language:: Czech
Description:: CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech and Czech pupils with Romani background. To create this corpus, unreleased CzeSL-man corpus (http://utkl.ff.cuni.cz/learncorp/) was utilized. All sentences in the corpus are word tokenized.
Rights:: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB

2. Many Czech References for 50 Sentences Selected from WMT11 Data

Creator:: Bojar, Ondřej, Macháček, Matouš, Tamchyna, Aleš, and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, automatic machine translation evaluation, and reference translation
Language:: Czech
Description:: This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11). In total, there are 15431447 Czech sentences, i.e. 300k reference translations per source English sentence on average, but the exact number greatly varies across sentences. You can find more details in included README file. If you use this dataset, please cite the following paper which describes the technique used to construct the Czech translations: Bojar Ondřej, Macháček Matouš, Tamchyna Aleš, Zeman Daniel: Scratching the Surface of Possible Translations. Lecture Notes in Computer Science, Vol. 8082, Text, Speech and Dialogue: 16th International Conference, TSD 2013. Proceedings, Copyright © Springer Verlag, Berlin / Heidelberg, ISBN 978-3-642-40584-6, ISSN 0302-9743, pp. 465-474, 2013, DOI: 10.1007/978-3-642-40585-3_59 and P406/11/1499 of the Grant Agency of the Czech Republic, FP7-ICT-2011-7-288487 (MosesCore) of the European Union and 1356213 of the Grant Agency of the Charles University
Rights:: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB

6. W2C – Web to Corpus – tool

Creator:: Majliš, Martin
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService and suiteOfTools
Subject:: web data, wikipedia, and corpus creation
Description:: A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc. A set of 120 corpora collected using this tool is available at https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0022-6133-9
Rights:: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB

1. CzeSL Grammatical Error Correction Dataset (CzeSL-GEC)

2. Many Czech References for 50 Sentences Selected from WMT11 Data

3. Plaintext Wikipedia dump 2018

4. Spoken corpus of Karel Makoň

5. W2C – Web to Corpus – Corpora

6. W2C – Web to Corpus – tool

Limit your search

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Creator

Show values starting with

Language

Show values starting with

Publisher

Rights

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from