Creator: Barančíková, Petra / Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Creator:: Barančíková, Petra and Tamchyna, Aleš
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, automatic evaluation, and paraphrasing
Language:: Czech
Description:: This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years 2011, 2013 and 2014. For each sentence, at most 10000 paraphrases were included (randomly selected from the full set). The goal of using this dataset is to improve automatic evaluation of machine translation outputs. If you use this work, please cite the following paper: Tamchyna Aleš, Barančíková Petra: Automatic and Manual Paraphrases for MT Evaluation. In proceedings of LREC, 2016.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

Creator:: Barančíková, Petra and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: sentences, sentence embeddings, paraphrases, and semantic relations
Language:: Czech
Description:: COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation. The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

Creator:: Barančíková, Petra and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: paraphrases, sentence embeddings, evaluation, and sentence
Language:: Czech
Description:: Costra 1.1 is a new dataset for testing geometric properties of sentence embeddings spaces. In particular, it concentrates on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of Costra 1.0, which was extended with more sentences and sentence comparisons.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

Creator:: Barančíková, Petra and Kettnerová, Václava
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, machineReadableDictionary, and lexicalConceptualResource
Subject:: multiword expressions, light verb construction, paraphrases, and idioms
Language:: Czech
Description:: ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs. The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

Creator:: Barančíková, Petra and Kettnerová, Václava
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, machineReadableDictionary, and lexicalConceptualResource
Subject:: multiword expressions, light verb construction, paraphrases, and idioms
Language:: Czech
Description:: ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs. The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

Creator:: Barančíková, Petra and Kettnerová, Václava
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, machineReadableDictionary, and lexicalConceptualResource
Subject:: light verb construction and paraphrases
Language:: Czech
Description:: Dictionary of single verb paraphrases of Czech light verb constructions.
Rights:: Public Domain Mark (PD), http://creativecommons.org/publicdomain/mark/1.0/, and PUB

Limit your search