Contributor: European Union@@EC/H2020/825303@@Bergamot - Browser-based Multilingual Translation@@euFunds@@info:eu-repo/grantAgreement/EC/H2020/825303 / Creator: Bojar, Ondřej / Rights: PUB

Start Over Contributor European Union@@EC/H2020/825303@@Bergamot - Browser-based Multilingual Translation@@euFunds@@info:eu-repo/grantAgreement/EC/H2020/825303 Creator Bojar, Ondřej Rights PUB

1. COSTRA 1.0: A Dataset of Complex Sentence Transformations

Creator:: Barančíková, Petra and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: sentences, sentence embeddings, paraphrases, and semantic relations
Language:: Czech
Description:: COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation. The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

2. COSTRA 1.1: A Dataset of Complex Sentence Transformations and Comparisons

Creator:: Barančíková, Petra and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: paraphrases, sentence embeddings, evaluation, and sentence
Language:: Czech
Description:: Costra 1.1 is a new dataset for testing geometric properties of sentence embeddings spaces. In particular, it concentrates on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of Costra 1.0, which was extended with more sentences and sentence comparisons.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

3. EMMT (Eyetracked Multi-Modal Translation)

Creator:: Bhattacharya, Sunit, Kloudová, Věra, Zouhar, Vilém, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: sight translation and multi-modal
Language:: English and Czech
Description:: Eyetracked Multi-Modal Translation (EMMT) is a simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios. It contains monocular eye movement recordings, audio data and 4-electrode wearable electroencephalogram (EEG) data of 43 participants while engaged in sight translation supported by an image. The details about the experiment and the dataset can be found in the README file.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

4. Ptakopět data: the dataset for experiments on outbound translation

Creator:: Novák, Michal, Zouhar, Vilém, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, interactive, and web forms
Language:: English and Czech
Description:: The dataset used for the Ptakopět experiment on outbound machine translation. It consists of screenshots of web forms with user queries entered. The queries are available also in a text form. The dataset comprises two language versions: English and Czech. Whereas the English version has been fully post-processed (screenshots cropped, queries within the screenshots highlighted, dataset split based on its quality etc.), the Czech version is raw as it was collected by the annotators.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

1. COSTRA 1.0: A Dataset of Complex Sentence Transformations

2. COSTRA 1.1: A Dataset of Complex Sentence Transformations and Comparisons

3. EMMT (Eyetracked Multi-Modal Translation)

4. Ptakopět data: the dataset for experiments on outbound translation

Limit your search

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Creator

Language

Publisher

Rights

Subject

Show values starting with

Type

Original context has metadata only

Harvested from