Rights: PUB / Subject: machine translation - LINDAT/CLARIAH-CZ Catalog Search Results

1. A Small Dataset for English-to-Czech Speech Translation in the Travel Domain

Creator:: Cífka, Ondřej and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: speech corpus, ASR, and machine translation
Language:: English and Czech
Description:: This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate). The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied. The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

2. APE Shared Task WMT17: Human Post-edits Test Data DE-EN

Creator:: Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: Human post-edits, machine translation, shared task, automatic post-editing, and post-editing
Language:: English
Description:: Human post-edited test sentences for the WMT 2017 Automatic post-editing task. This consists in 2,000 English sentences belonging to the IT domain and already tokenized. Source and target segments can be downloaded from: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2132. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

3. APE Shared Task WMT17: Human Post-edits Test Data EN-DE

Creator:: Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, human post-edits, shared task, automatic post-editing, and post-editing
Language:: German
Description:: Human post-edited test sentences for the WMT 2017 Automatic post-editing task. This consists in 2,000 German sentences belonging to the IT domain and already tokenized. Source and target segments can be downloaded from: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2133. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

4. Automatic Paraphrases of Czech Reference Sentences for WMT11, 13 and 14

Creator:: Barančíková, Petra and Tamchyna, Aleš
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, automatic evaluation, and paraphrasing
Language:: Czech
Description:: This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years 2011, 2013 and 2014. For each sentence, at most 10000 paraphrases were included (randomly selected from the full set). The goal of using this dataset is to improve automatic evaluation of machine translation outputs. If you use this work, please cite the following paper: Tamchyna Aleš, Barančíková Petra: Automatic and Manual Paraphrases for MT Evaluation. In proceedings of LREC, 2016.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

5. Česílko

Creator:: Hajič, Jan, Kuboň, Vladislav, and Homola, Petr
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService
Subject:: machine translation and Czech-Slovak translation
Language:: Czech
Description:: Česílko is a tool enabling the fast and efficient translation from one source language into many target languages, which are mutually related.
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

6. CUBBITT Translation Models (en-cs) (v1.0)

Creator:: Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, neural machine translation, transformer, and cubbitt
Language:: English and Czech
Description:: CUBBITT En-Cs translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2014 (BLEU): en->cs: 27.6 cs->en: 34.4 (Evaluated using multeval: https://github.com/jhclark/multeval)
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

7. CUBBITT Translation Models (en-fr) (v1.0)

Creator:: Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, neural machine translation, transformer, and cubbitt
Language:: English and French
Description:: CUBBITT En-Fr translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2014 (BLEU): en->fr: 38.2 fr->en: 36.7 (Evaluated using multeval: https://github.com/jhclark/multeval)
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

8. CUBBITT Translation Models (en-pl) (v1.0)

Creator:: Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, neural machine translation, transformer, and cubbitt
Language:: English and Polish
Description:: CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->pl: 12.3 pl->en: 20.0 (Evaluated using multeval: https://github.com/jhclark/multeval)
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

9. Czech image captioning, machine translation, and sentiment analysis (Neural Monkey models)

Creator:: Libovický, Jindřich, Rosa, Rudolf, Helcl, Jindřich, and Popel, Martin
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: suiteOfTools and toolService
Subject:: sentiment analysis, machine translation, image captioning, neural networks, transformer, and Neural Monkey
Language:: Czech and English
Description:: This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving three NLP tasks: machine translation, image captioning, and sentiment analysis. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The models are described in the accompanying paper. The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd There are several separate ZIP archives here, each containing one model solving one of the tasks for one language. To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory. Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization). The 'experiment.ini' file, which was used to train the model, is also included. Then there are files containing the model itself, files containing the input and output vocabularies, etc. For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/ For the machine translation, you do not need to tokenize the data, as this is done by the model. For image captioning, you need to: - download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz - clone the git repository with TensorFlow models: https://github.com/tensorflow/models - preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script Feel free to contact the authors of this submission in case you run into problems!
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

10. Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)

Creator:: Libovický, Jindřich, Rosa, Rudolf, Helcl, Jindřich, and Popel, Martin
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: suiteOfTools and toolService
Subject:: sentiment analysis, machine translation, image captioning, neural networks, transformer, Neural Monkey, and summarization
Language:: Czech and English
Description:: This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving four NLP tasks: machine translation, image captioning, sentiment analysis, and summarization. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The models are described in the accompanying paper. The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd In addition to the models presented in the referenced paper (developed and published in 2018), we include models for automatic news summarization for Czech and English developed in 2019. The Czech models were trained using the SumeCzech dataset (https://www.aclweb.org/anthology/L18-1551.pdf), the English models were trained using the CNN-Daily Mail corpus (https://arxiv.org/pdf/1704.04368.pdf) using the standard recurrent sequence-to-sequence architecture. There are several separate ZIP archives here, each containing one model solving one of the tasks for one language. To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory. Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization). The 'experiment.ini' file, which was used to train the model, is also included. Then there are files containing the model itself, files containing the input and output vocabularies, etc. For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/ For the machine translation, you do not need to tokenize the data, as this is done by the model. For image captioning, you need to: - download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz - clone the git repository with TensorFlow models: https://github.com/tensorflow/models - preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script The summarization models require input that is tokenized with Moses Tokenizer (https://github.com/alvations/sacremoses) and lower-cased. Feel free to contact the authors of this submission in case you run into problems!
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

11. Depfix: Automatic Post-editing of SMT

Creator:: Rosa, Rudolf
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, post-editing, Treex, morphology, and parsing
Language:: English and Czech
Description:: Depfix, a tool for Automatic Post-editing of SMT. See the project website for more information.
Rights:: GNU General Public License, version 2, http://www.gnu.org/licenses/gpl-2.0.html, and PUB

12. DiscoMT 2015 Shared Task on Pronoun Translation

Creator:: Hardmeier, Christian, Tiedemann, Jörg, Nakov, Preslav, Stymne, Sara, and Versley, Yannick
Publisher:: Uppsala University
Type:: text and corpus
Subject:: machine translation, coreference resolution, anaphora resolution, and discourse
Language:: English and French
Description:: The data set includes training, development and test data from the shared tasks on pronoun-focused machine translation and cross-lingual pronoun prediction from the EMNLP 2015 workshop on Discourse in Machine Translation (DiscoMT2015). The release also contains the submissions to the pronoun-focused machine translation along with the manual annotations used for the official evaluation as well as gold-standard annotations of pronoun coreference for the shared task test set.
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

13. DiscoMT 2016 Shared Task on Cross-lingual Pronoun Prediction

Creator:: Guillou, Liane, Hardmeier, Christian, Nakov, Preslav, Stymne, Sara, Tiedemann, Jörg, Versley, Yannick, Cettolo, Mauro, Webber, Bonnie, and Popescu-Belis, Andrei
Publisher:: Uppsala University
Type:: text and corpus
Subject:: machine translation, coreference, discourse, and pronouns
Language:: English, French, and German
Description:: Files for the DiscoMT 2016 shared task on cross-lingual pronoun prediction
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

14. DiscoMT 2017 Shared Task on Cross-lingual Pronoun Prediction

Creator:: Loáiciga, Sharid, Stymne, Sara, Nakov, Preslav, Hardmeier, Christian, Tiedemann, Jörg, Cettolo, Mauro, and Versley, Yannick
Publisher:: Uppsala University
Type:: text and corpus
Subject:: machine translation, discourse, coreference, and pronouns
Language:: English, Spanish, German, and French
Description:: Data used in the 2017 shared task on cross-lingual pronoun prediction.
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

15. English-Urdu Religious Parallel Corpus

Creator:: Jawaid, Bushra and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus, religious text, and machine translation
Language:: English and Urdu
Description:: English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

16. Extended CLEF eHealth 2013-2015 IR Test Collection

Creator:: Pecina, Pavel and Saleh, Shadi
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: cross-lingual information retrieval and machine translation
Language:: English, Czech, French, German, Hungarian, Polish, Spanish, and Swedish
Description:: This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it provides complete query translations into Czech, French, German, Hungarian, Polish, Spanish and Swedish and additional relevance assessment.
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

17. FAUST cs-en 0.5

Creator:: Hajič, Jan, Mareček, David, Fučíková, Eva, Cinková, Silvie, Štěpánek, Jan, Mikulová, Marie, and Popel, Martin
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: noisy texts, parallel corpus, and machine translation
Language:: English and Czech
Description:: This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308). Each original (noisy) sentence was normalized (clean1 and clean2) and translated to English independently by two translators.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

18. Hausa Visual Genome 1.0

Creator:: Abdulmumin, Idris, Das, Satya Ranja, Dawud, Musa Abdullahi, Parida, Shantipriya, Muhammad, Shamsuddeen Hassan, Ahmad, Ibrahim Sa'id, Panda, Subhadarshi, Bojar, Ondřej, Galadanci, Bashir Shehu, and Bello, Bello Shehu
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: image and corpus
Subject:: multi-modal, machine translation, image captioning, image annotation, and neural machine translation
Language:: Hausa and English
Description:: Data ------- Hausa Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hausa multimodal machine translation tasks and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as the dataset Hindi Visual Genome 1.1 has. We automatically translated the English captions to Hausa and manually post-edited, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. Additionally, a challenge test set of 1400 segments is available for the multi-modal task. This challenge test set was created in Hindi Visual Genome by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Dataset Formats ----------------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Hausa Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width, and Height columns indicate the rectangular region in the image described by the caption. Data Statistics -------------------- The statistics of the current release are given below. Parallel Corpus Statistics ----------------------------------- Dataset Segments English Words Hausa Words ---------- -------- ------------- ----------- Train 28930 143106 140981 Dev 998 4922 4857 Test 1595 7853 7736 Challenge Test 1400 8186 8752 ---------- -------- ------------- ----------- Total 32923 164067 162326 The word counts are approximate, prior to tokenization. Citation ----------- If you use this corpus, please cite the following paper: @InProceedings{abdulmumin-EtAl:2022:LREC, author = {Abdulmumin, Idris and Dash, Satya Ranjan and Dawud, Musa Abdullahi and Parida, Shantipriya and Muhammad, Shamsuddeen and Ahmad, Ibrahim Sa'id and Panda, Subhadarshi and Bojar, Ond{\v{r}}ej and Galadanci, Bashir Shehu and Bello, Bello Shehu}, title = "{Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation}", booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {6471--6479}, url = {https://aclanthology.org/2022.lrec-1.694} }
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

19. Hindi Visual Genome 1.0

Creator:: Parida, Shantipriya and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: image and corpus
Subject:: parallel corpus, corpus, multilingual, machine translation, shared task, English-Hindi parallel corpus, image captioning, and multi-modal
Language:: English and Hindi
Description:: Data ---- Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. Additionally, a challenge test set of 1400 segments will be released for the WAT2019 multi-modal task. This challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Dataset Formats -------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Hindi Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption. Data Statistics ---------------- The statistics of the current release is given below. Parallel Corpus Statistics --------------------------- Dataset Segments English Words Hindi Words ------- --------- ---------------- ------------- Train 28932 143178 136722 Dev 998 4922 4695 Test 1595 7852 7535 Challenge Test 1400 8185 8665 (Released separately) ------- --------- ---------------- ------------- Total 32925 164137 157617 The word counts are approximate, prior to tokenization. Citation -------- If you use this corpus, please cite the following paper: @article{hindi-visual-genome:2019, title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}}, author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan}, journal={Computaci{\'o}n y Sistemas}, note={In print. Presented at CICLing 2019, La Rochelle, France}, year={2019}, }
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

20. Khresmoi Query Translation Test Data 1.0

Creator:: Pecina, Pavel, Dušek, Ondřej, Hajič, Jan, and Urešová, Zdeňka
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus, test data, medical, health, machine translation, Czech, French, German, and English
Language:: English, French, German, and Czech
Description:: This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from general public and medical experts. and This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013). We thank Health on the Net Foundation for granting the license for the English general public queries, TRIP database for granting the license for the English medical expert queries, and three anonymous translators and three medical experts for translating amd revising the data.
Rights:: Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0), http://creativecommons.org/licenses/by-nc/3.0/, and PUB

21. Khresmoi Query Translation Test Data 2.0

Creator:: Pecina, Pavel, Dušek, Ondřej, Hajič, Jan, Libovický, Jindřich, and Urešová, Zdeňka
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus, test data, medical, health, machine translation, Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
Language:: Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
Description:: This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans Swedish. The queries come from general public and medical experts. This is version 2.0 extending the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

22. Khresmoi Summary Translation Test Data 1.1

Creator:: Dušek, Ondřej, Hajič, Jan, Hlaváčová, Jaroslava, Pecina, Pavel, Tamchyna, Aleš, and Urešová, Zdeňka
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus, test data, medical, health, machine translation, Czech, French, German, and English
Language:: English, Czech, French, and German
Description:: This package contains data sets for development and testing of machine translation of sentences from summaries of medical articles between Czech, English, French, and German. and This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013). We thank all the data providers and copyright holders for providing the source data and anonymous experts for translating the sentences.
Rights:: Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0), http://creativecommons.org/licenses/by-nc/3.0/, and PUB

23. Khresmoi Summary Translation Test Data 2.0

Creator:: Dušek, Ondřej, Hajič, Jan, Hlaváčová, Jaroslava, Libovický, Jindřich, Pecina, Pavel, Tamchyna, Aleš, and Urešová, Zdeňka
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus, test data, medical, health, machine translation, Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
Language:: Czech, English, French, German, Hungarian, Polish, Spanish, and Swedish
Description:: This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech, English, French, German, Hungarian, Polish, Spanish and Swedish. Version 2.0 extends the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

24. Large-Scale Colloquial Persian 0.5

Creator:: Abdi Khojasteh, Hadi, Ansari, Ebrahim, and Bohlouli, Mahdi
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and Institute for Advanced Studies in Basic Sciences (IASBS)
Type:: text and corpus
Subject:: PoS tagging, corpus, annotated corpus, multilingual, derivation, dependency parser, machine translation, informal language, spoken language, monolingual corpus, and bilingual corpus annotation
Language:: Persian, English, German, Czech, Italian, and Hindi
Description:: "Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in five different languages (EN, CS, DE, IT, HI).
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

25. LINDAT Translation service

Creator:: Košarko, Ondřej, Variš, Dušan, and Popel, Martin
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: service and toolService
Subject:: machine translation and frontend
Description:: Source code of the LINDAT Translation service frontend. The service provides a UI and a simple rest api that accesses machine translation models served by tensorflow serving. The most recent version of the code is available at https://github.com/ufal/lindat_translation.
Rights:: BSD 2-Clause "Simplified" or "FreeBSD" license, http://opensource.org/licenses/BSD-2-Clause, and PUB

26. LiStr: Linguistic Structure Induction Tookit

Creator:: Mareček, David and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: parsing, unsupervised machine learning, machine translation, and grammar induction
Language:: English
Description:: This toolkit comprises the tools and supporting scripts for unsupervised induction of dependency trees from raw texts or texts with already assigned part-of-speech tags. There are also scripts for simple machine translation based on unsupervised parsing and scripts for minimally supervised parsing into Universal-Dependencies style.
Rights:: GNU General Public Licence, version 3, http://opensource.org/licenses/GPL-3.0, and PUB

27. Machine Translation Testsuite for Gender-Consistent Translation

Creator:: Aires, João Paulo
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, testsuite, evaluation, and gender
Language:: English and Czech
Description:: Document-level testsuite for evaluation of gender translation consistency. Our Document-Level test set consists of selected English documents from the WMT21 newstest annotated with gender information. Czech unnanotated references are also added for convenience. We semi-automatically annotated person names and pronouns to identify the gender of these elements as well as coreferences. Our proposed annotation consists of three elements: (1) an ID, (2) an element class, and (3) gender. The ID identifies a person's name and its occurrences (name and pronouns). The element class identifies whether the tag refers to a name or a pronoun. Finally, the gender information defines whether the element is masculine or feminine. We performed a series of NLP techniques to automatically identify person names and coreferences. This initial process resulted in a set containing 45 documents to be manually annotated. Thus, we started a manual annotation of these documents to make sure they are correctly tagged. See README.md for more details.
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

28. Manual Re-evaluation of Translation Quality of WMT 2018 English-Czech systems

Creator:: Popel, Martin, Tomková, Markéta, and Tomek, Jakub
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, manual evaluation, fluency, adequacy, and Translation Turing test
Language:: Czech and English
Description:: This data set contains four types of manual annotation of translation quality, focusing on the comparison of human and machine translation quality (aka human-parity). The machine translation system used is English-Czech CUNI Transformer (CUBBITT). The annotations distinguish adequacy, fluency and overall quality. One of the types is Translation Turing test - detecting whether the annotators can distinguish human from machine translation. All the sentences are taken from the English-Czech test set newstest2018 (WMT2018 News translation shared task www.statmt.org/wmt18/translation-task.html), but only from the half with originally English sentences translated to Czech by a professional agency.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

29. Manually Classified Errors in Cs->Sk Translation

Creator:: Galuščáková, Petra and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, errors classification, and CS-SK translation
Language:: Slovak and Czech
Description:: Manual classification of errors of Czech-Slovak translation according to the classification introduced by Vilar et al. [1]. First 50 sentences from WMT 2010 test set were translated by 5 MT systems (Česílko, Česílko2, Google Translate and two Moses setups) and MT errors were manually marked and classified. Classification was applied in MT systems comparison [3]. Reference translation is included. References: [1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006. [2] http://matrix.statmt.org/test_sets/list [3] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011 and This work has been supported by the grants Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

30. Manually Classified Errors in En->Sk Translation

Creator:: Galuščáková, Petra and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, errors classification, and EN-SK translation
Language:: Slovak and English
Description:: Manual classification of errors of English-Slovak translation according to the classification introduced by Vilar et al. [1]. 50 sentences randomly selected from WMT 2011 test set [2] were translated by 3 MT systems described in [3] and MT errors were manually marked and classified. Reference translation is included. References: [1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006. [2] http://www.statmt.org/wmt11/evaluation-task.html [3] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press. and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

31. Manually Ranked Translation Outputs

Creator:: Bojar, Ondřej and Galuščáková, Petra
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, evaluation, and manual ranking
Language:: Slovak and Czech
Description:: Manually ranked outputs of Czech-Slovak translations. Three annotators manually ranked outputs of five MT systems (Česílko, Česílko2, Google Translate and two Moses setups) on three data sets (100 sentences randomly selected from books, 100 sentences randomly selected from Acquis corpus and 50 first sentences from WMT 2010 test set). Ranking was applied in MT systems comparison in [1]. References: [1] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011 and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

32. Many Czech References for 50 Sentences Selected from WMT11 Data

Creator:: Bojar, Ondřej, Macháček, Matouš, Tamchyna, Aleš, and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, automatic machine translation evaluation, and reference translation
Language:: Czech
Description:: This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11). In total, there are 15431447 Czech sentences, i.e. 300k reference translations per source English sentence on average, but the exact number greatly varies across sentences. You can find more details in included README file. If you use this dataset, please cite the following paper which describes the technique used to construct the Czech translations: Bojar Ondřej, Macháček Matouš, Tamchyna Aleš, Zeman Daniel: Scratching the Surface of Possible Translations. Lecture Notes in Computer Science, Vol. 8082, Text, Speech and Dialogue: 16th International Conference, TSD 2013. Proceedings, Copyright © Springer Verlag, Berlin / Heidelberg, ISBN 978-3-642-40584-6, ISSN 0302-9743, pp. 465-474, 2013, DOI: 10.1007/978-3-642-40585-3_59 and P406/11/1499 of the Grant Agency of the Czech Republic, FP7-ICT-2011-7-288487 (MosesCore) of the European Union and 1356213 of the Grant Agency of the Charles University
Rights:: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB

33. MCSQ Translation Models (en-de) (v1.0)

Creator:: Variš, Dušan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, transformer, and neural machine translation
Language:: English and German
Description:: En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->de: 67.5 (train: genuine in-domain MCSQ data only) de->en: 75.0 (train: additional in-domain backtranslated MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

34. MCSQ Translation Models (en-ru) (v1.0)

Creator:: Variš, Dušan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, neural machine translation, and transformer
Language:: English and Russian
Description:: En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->ru: 64.3 (train: genuine in-domain MCSQ data) ru->en: 74.7 (train: additional backtranslated in-domain MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

35. MTMonkey

Creator:: Tamchyna, Aleš, Dušek, Ondřej, and Rosa, Rudolf
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService and infrastructure
Subject:: machine translation, distributed computing, web service, and infrastructure
Description:: MTMonkey is a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre- and post processing. It consists of an application server and remote workers which handle text processing and communicate translation requests to MT systems. The communication between the application server and the workers is based on the XML-RPC protocol. and The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 257528 (KHRESMOI). This work has been using language resources developed and/or stored and/or distributed by the LINDAT-Clarin project of the Ministry of Education of the Czech Republic (project LM2010013). This work has been supported by the AMALACH grant (DF12P01OVV02) of the Ministry of Culture of the Czech Republic.
Rights:: Apache License 2.0, http://opensource.org/licenses/Apache-2.0, and PUB

36. OdiEnCorp 2.0

Creator:: Parida, Shantipriya and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus, corpus, machine translation, and under-resourced language
Language:: Oriya (macrolanguage) and English
Description:: Data ----- We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel corpora such as OdiEnCorp 1.0 and PMIndia, and books which contain both English and Odia text such as grammar and bilingual literature books. We also included parallel text from multiple public websites such as Odia Wikipedia, Odia digital library, and Odisha Government websites. The parallel corpus covers many domains: the Bible, other literature, Wiki data relating to many topics, Government policies, and general conversation. We have processed the raw data collected from the books, websites, performed sentence alignments (a mix of manual and automatic alignments) and released the corpus in a form suitable for various NLP tasks. Corpus Format ------------- OdiEnCorp 2.0 is stored in simple tab-delimited plain text files, each with three tab-delimited columns: - a coarse indication of the domain - the English sentence - the corresponding Odia sentence The corpus is shuffled at the level of sentence pairs. The coarse domains are: books ... prose text dict ... dictionaries and phrasebooks govt ... partially formal text odiencorp10 ... OdiEnCorp 1.0 (mix of domains) pmindia ... PMIndia (the original corpus) wikipedia ... sentences and phrases from Wikipedia Data Statistics --------------- The statistics of the current release are given below. Note that the statistics differ from those reported in the paper due to deduplication at the level of sentence pairs. The deduplication was performed within each of the dev set, test set and training set and taking the coarse domain indication into account. It is still possible that the same sentence pair appears more than once within the same set (dev/test/train) if it came from different domains, and it is also possible that a sentence pair appears in several sets (dev/test/train). Parallel Corpus Statistics -------------------------- Dev Dev Dev Test Test Test Train Train Train Sents # EN # OD Sents # EN # OD Sents # EN # OD books 3523 42011 36723 3895 52808 45383 3129 40461 35300 dict 3342 14580 13838 3437 14807 14110 5900 21591 20246 govt - - - - - - 761 15227 13132 odiencorp10 947 21905 19509 1259 28473 24350 26963 704114 602005 pmindia 3836 70282 61099 3836 68695 59876 30687 551657 486636 wikipedia 1896 9388 9385 1917 21381 20951 1930 7087 7122 Total 13544 158166 140554 14344 186164 164670 69370 1340137 1164441 "Sents" are the counts of the sentence pairs in the given set (dev/test/train) and domain (books/dict/...). "# EN" and "# OD" are approximate counts of words (simply space-delimited, without tokenization) in English and Odia The total number of sentence pairs (lines) is 13544+14344+69370=97258. Ignoring the set and domain and deduplicating again, this number drops to 94857. Citation -------- If you use this corpus, please cite the following paper: @inproceedings{parida2020odiencorp, title={OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation}, author={Parida, Shantipriya and Dash, Satya Ranjan and Bojar, Ond{\v{r}}ej and Motlicek, Petr and Pattnaik, Priyanka and Mallick, Debasish Kumar}, booktitle={Proceedings of the WILDRE5--5th Workshop on Indian Language Data: Resources and Evaluation}, pages={14--19}, year={2020} }
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

37. ParaCrawl Corpus version 1.0

Creator:: Koehn, Philipp, Heafield, Kenneth, Forcada, Mikel L., Esplà-Gomis, Miquel, Ortiz-Rojas, Sergio, Sánchez, Gema Ramírez, Cartagena, Víctor M. Sánchez, Haddow, Barry, Bañón, Marta, Střelec, Marek, Samiotou, Anna, and Kamran, Amir
Publisher:: ParaCrawl
Type:: text and corpus
Subject:: ParaCrawl, parallel corpus, CommonCrawl, machine translation, and text corpora
Language:: English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Finnish, Latvian, Russian, and Estonian
Description:: The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

38. Plain-Moses-Chimera

Creator:: Bojar, Ondřej and Tamchyna, Aleš
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: toolService and suiteOfTools
Subject:: moses and machine translation
Language:: English and Czech
Description:: Statistical component of Chimera, a state-of-the-art MT system. and Project DF12P01OVV022 of the Ministry of Culture of the Czech Republic (NAKI -- Amalach).
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

39. Ptakopět data: the dataset for experiments on outbound translation

Creator:: Novák, Michal, Zouhar, Vilém, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, interactive, and web forms
Language:: English and Czech
Description:: The dataset used for the Ptakopět experiment on outbound machine translation. It consists of screenshots of web forms with user queries entered. The queries are available also in a text form. The dataset comprises two language versions: English and Czech. Whereas the English version has been fully post-processed (screenshots cropped, queries within the screenshots highlighted, dataset split based on its quality etc.), the Czech version is raw as it was collected by the annotators.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

40. QT21 Data

Creator:: Specia, Lucia
Publisher:: QT21 project
Type:: text and corpus
Subject:: machine translation, post-editing, error annotation, and mqm
Language:: English, German, Czech, and Latvian
Description:: Post-editing and MQM annotations produced by the QT21 project. As described in @InProceedings{specia-etal_MTSummit:2017, author = {Specia, Lucia and Kim Harris and Frédéric Blain and Aljoscha Burchardt and Viviven Macketanz and Inguna Skadiņa and Matteo Negri and and Marco Turchi}, title = {Translation Quality and Productivity: A Study on Rich Morphology Languages}, booktitle = {Proceedings of Machine Translation Summit XVI}, year = {2017}, pages = {55--71}, address = {Nagoya, Japan}, }
Rights:: AGREEMENT ON THE USE OF DATA IN QT21, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

41. Tensor2tensor Translation for Docker

Creator:: Variš, Dušan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, neural machine translation, tensor2tensor, and docker
Description:: This submission contains Dockerfile for creating a Docker image with compiled Tensor2tensor backend with compatible (TensorFlow Serving) models available in the Lindat Translation service (https://lindat.mff.cuni.cz/services/transformer/). Additionally, the submission contains a web frontend for simple in-browser access to the dockerized backend service. Tensor2Tensor (https://github.com/tensorflow/tensor2tensor) is a library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Rights:: Apache License 2.0, http://opensource.org/licenses/Apache-2.0, and PUB

42. Test Data DE-EN APE Shared Task WMT17

Creator:: Turchi, Marco, Chatterjee, Rajen, and Negri, Marco
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, shared task, automatic post-editing, and post-editing
Language:: English and German
Description:: Test data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in German-English triplets (source and target) belonging to the pharmacological domain and already tokenized. Test set contains 2,000 pairs. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

43. Test Data EN-DE APE Shared Task WMT17

Creator:: Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, shared task, automatic post-editing, and post-editing
Language:: English and German
Description:: Test data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in 2,000 English-German pairs (source and target) belonging to the IT domain and already tokenized. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

44. Test Data EN-DE MT_NMT APE Shared Task WMT18

Creator:: Chatterjee, Rajen, Negri, Matteo, and Turchi, Marco
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, shared task, automatic post-editing, post-editing, and neural machine translation
Language:: English and German
Description:: Test data for the WMT 2018 Automatic post-editing task. They consist in English-German pairs (source and target) belonging to the information technology domain and already tokenized. Test set contains 1,023 pairs. A neural machine translation system has been used to generate the target segments. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

45. Test Data EN-DE MT_PBSMT APE Shared Task WMT18

Creator:: Turchi, Marco, Negri, Matteo, and Chatterjee, Rajen
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, shared task, automatic post-editing, post-editing, and phrase-based MT
Language:: English and German
Description:: Test data for the WMT 2018 Automatic post-editing task. They consist in English-German pairs (source and target) belonging to the information technology domain and already tokenized. Test set contains 2,000 pairs. A phrase-based machine translation system has been used to generate the target segments. This test set is sampled from the same dataset used for the 2016 and 2017 APE shared task editions. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

46. TMODS:ENG-CZE -- query translation

Creator:: Tamchyna, Aleš and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: suiteOfTools and toolService
Subject:: machine translation and query translationn
Language:: Czech and English
Description:: AMALACH project component TMODS:ENG-CZE; machine translation of queries from Czech to English. This archive contains models for the Moses decoder (binarized, pruned to allow for real-time translation) and configuration files for the MTMonkey toolkit. The aim of this package is to provide a full service for Czech->English translation which can be easily utilized as a component in a larger software solution. (The required tools are freely available and an installation guide is included in the package.) The translation models were trained on CzEng 1.0 corpus and Europarl. Monolingual data for LM estimation additionally contains WMT news crawls until 2013.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

47. Translation Models (en-de) (v1.0)

Creator:: Variš, Dušan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, neural machine translation, and transformer
Language:: English and German
Description:: En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->de: 25.9 de->en: 33.4 (Evaluated using multeval: https://github.com/jhclark/multeval)
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

48. Translation Models (en-ru) (v1.0)

Creator:: Variš, Dušan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, neural machine translation, and transformer
Language:: English and Russian
Description:: En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->ru: 18.0 ru->en: 30.4 (Evaluated using multeval: https://github.com/jhclark/multeval)
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

49. UFAL Parallel Corpus of North Levantine 1.0

Creator:: Sellat, Hashem, Saleh, Shadi, Krubiński, Mateusz, Pospíšil, Adam, Zemánek, Petr, and Pecina, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: multilingual, machine translation, parallel corpus, north levantine, and corpus
Language:: North Levantine Arabic, English, French, Spanish, Standard Arabic, Modern Greek (1453-), and German
Description:: This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the Welcome project (https://welcome-h2020.eu/). The corpus consists of 120,600 multiparallel sentences in English, French, German, Greek, Spanish, and Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation for North Levantine and the other languages.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

50. Ukrainian War Refugees as Self-Translators Dataset

Creator:: Agapova, Anna and Špačková, Stanislava
Publisher:: Taylor & Francis Online
Type:: TEXT and Spreadsheet
Subject:: machine translation, self-translation, migration, Ukrainian refugees, Russo-Ukrainian war, and Czech Republic
Language:: Ukrainian and Russian
Description:: Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented spreadsheet contains minimally processed data exported from the two questionnaires that were created in Google Forms in the Ukrainian and the Russian language. The links to these questionnaires were distributed by three methods: direct email to particular refugees whose contact details the authors obtained while volunteering; through a non-profit organisation helping refugees (Vesna women’s education institution) and on social networks by posting links to the survey in groups associating the Ukrainian community across Czech regions and towns. Since we asked potential respondents to spread the questionnaire further, we could not prevent it from reaching Ukrainians who had arrived in Czechia previously, or received temporary protection in other countries. Due to this fact, the textual answers to the question 1.5 "Which country are you in right now?" were replaced in the dataset by numbers (1 for Czech Republic, 2 for other countries) in order for us to be able to separate the data of respondents not located in the Czech Republic, which were irrelevant for our survey.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

51. WMT16 APE Shared Task Data

Creator:: Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, machine learning, automatic postediting, and shared task
Language:: English and German
Description:: Training, development and text data (the same used for the Sentence-level Quality Estimation task) consist in English-German triplets (source, target and post-edit) belonging to the IT domain and already tokenized. Training and development respectively contain 12,000 and 1,000 triplets, while the test set 2,000 instances. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

52. WMT16 APE Shared Task Data - Reference sentences

Creator:: Turchi, Marco, Negri, Matteo, and Chatterjee, Rajen
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, machine learning, automatic post-editing, and shared task
Language:: German
Description:: Training, development and test data consist in German sentences belonging to the IT domain and already tokenized. These sentences are the references of the data released for the 2016 edition of the WMT APE shared task. Differently from the data previously released, these sentences are obtained by manually translating the source sentence without leveraging the raw mt outputs. Training and development respectively contain 12,000 and 1,000 segments, while the test set 2,000 items. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

53. WMT16 Quality Estimation Shared Task Training and Development Data

Creator:: Specia, Lucia, Logacheva, Varvara, and Scarton, Carolina
Publisher:: University of Sheffield
Type:: text and corpus
Subject:: machine translation, quality estimation, and machine learning
Language:: English and German
Description:: Training and development data for the WMT16 QE task. Test data will be published as a separate item. This shared task will build on its previous four editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, sentence-level and document-level estimation. The sentence and word-level tasks will explore a large dataset produced from post-editions by professional translators (as opposed to crowdsourced translations as in the previous year). For the first time, the data will be domain-specific (IT domain). The document-level task will use, for the first time, entire documents, which have been human annotated for quality indirectly in two ways: through reading comprehension tests and through a two-stage post-editing exercise. Our tasks have the following goals: - To advance work on sentence and word-level quality estimation by providing domain-specific, larger and professionally annotated datasets. - To study the utility of detailed information logged during post-editing (time, keystrokes, actual edits) for different levels of prediction. - To analyse the effectiveness of different types of quality labels provided by humans for longer texts in document-level prediction. This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for the sentence and word-level tasks, and multiple MT systems were used to produce translations for the document-level task. Therefore, MT system-dependent information will be made available where possible.
Rights:: AGREEMENT ON THE USE OF DATA IN QT21, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

54. WMT16 Tuning Shared Task Models (Czech-to-English)

Creator:: Kamran, Amir, Jawaid, Bushra, Bojar, Ondřej, and Stanojevic, Milos
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and University of Amsterdam, ILLC
Type:: text and corpus
Subject:: WMT16, machine translation, tuning, baseline models, and shared task
Language:: Czech and English
Description:: The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl. Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

55. WMT16 Tuning Shared Task Models (English-to-Czech)

Creator:: Kamran, Amir, Jawaid, Bushra, Bojar, Ondřej, and Stanojevic, Milos
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and University of Amsterdam, ILLC
Type:: text and corpus
Subject:: WMT16, machine translation, tuning, baseline models, and shared task
Language:: English and Czech
Description:: This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl. Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

56. WMT17 De-En APE Shared Task Data

Creator:: Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, shared task, automatic post-editing, and post-editing
Language:: German and English
Description:: Training and development data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in German-English triplets (source, target and post-edit) belonging to the pharmacological domain and already tokenized. Training and development respectively contain 25,000 and 1,000 triplets. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

57. WMT17 En-De APE Shared Task Data

Creator:: Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, shared task, post-editing, and automatic post-editing
Language:: English and German
Description:: Training data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in 11,000 English-German triplets (source, target and post-edit) belonging to the IT domain and already tokenized. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

58. WMT17 Quality Estimation Shared Task Training and Development Data

Creator:: Specia, Lucia and Logacheva, Varvara
Publisher:: University of Sheffield
Type:: text and corpus
Subject:: machine translation, quality estimation, and machine learning
Language:: English and German
Description:: Training and development data for the WMT17 QE task. Test data will be published as a separate item. This shared task will build on its previous five editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks will make use of a large dataset produced from post-editions by professional translators. The data will be domain-specific (IT and Pharmaceutical domains) and substantially larger than in previous years. In addition to advancing the state of the art at all prediction levels, our goals include: - To test the effectiveness of larger (domain-specific and professionally annotated) datasets. We will do so by increasing the size of one of last year's training sets. - To study the effect of language direction and domain. We will do so by providing two datasets created in similar ways, but for different domains and language directions. - To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, and actual edits. This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for all tasks. MT system-dependent information can be made available under request. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.
Rights:: AGREEMENT ON THE USE OF DATA IN QT21, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

59. WMT17 Quality Estimation Shared Test Data

Creator:: Specia, Lucia and Logacheva, Varvara
Publisher:: University of Sheffield
Type:: text and corpus
Subject:: machine translation, quality estimation, and machine learning
Language:: English and German
Description:: Test data for the WMT17 QE task. Train data can be downloaded from http://hdl.handle.net/11372/LRT-1974 This shared task will build on its previous five editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks will make use of a large dataset produced from post-editions by professional translators. The data will be domain-specific (IT and Pharmaceutical domains) and substantially larger than in previous years. In addition to advancing the state of the art at all prediction levels, our goals include: - To test the effectiveness of larger (domain-specific and professionally annotated) datasets. We will do so by increasing the size of one of last year's training sets. - To study the effect of language direction and domain. We will do so by providing two datasets created in similar ways, but for different domains and language directions. - To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, and actual edits. This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for all tasks. MT system-dependent information can be made available under request. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.
Rights:: AGREEMENT ON THE USE OF DATA IN QT21, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

60. WMT18 APE Shared Task: En-DE NMT Train and Dev Data

Creator:: Turchi, Marco, Negri, Matteo, and Chatterjee, Rajen
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, shared task, post-editing, and automatic post-editing
Language:: English and German
Description:: Training and development data for the WMT 2018 Automatic post-editing task. They consist in English-German triplets (source, target and post-edit) belonging to the information technology domain and already tokenized. Training and development respectively contain 13,442 and 1,000 triplets. A neural machine translation system has been used to generate the target segments. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

61. WMT18 Quality Estimation Shared Task Test Data

Creator:: Specia, Lucia, Logacheva, Varvara, Blain, Frederic, Fernandez, Ramon, and Martins, André
Publisher:: University of Sheffield
Type:: text and corpus
Subject:: machine translation, quality estimation, and machine learning
Language:: English, German, Czech, and Latvian
Description:: Test data for the WMT18 QE task. Train data can be downloaded from http://hdl.handle.net/11372/LRT-2619. This shared task will build on its previous six editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks make use of datasets produced from post-editions by professional translators. The datasets are domain-specific (IT and life sciences/pharma domains) and extend from those used previous years with more instances and more languages. One important addition is that this year we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are: To study the performance of quality estimation approaches on the output of neural MT systems. We will do so by providing datasets for two language language pairs where the same source segments are translated by both a statistical phrase-based and a neural MT system. To study the predictability of deleted words, i.e. words that are missing in the MT output. TO do so, for the first time we provide data annotated for such errors at training time. To study the effectiveness of explicitly assigned labels for phrases. We will do so by providing a dataset where each phrase in the output of a phrase-based statistical MT system was annotated by human translators. To study the effect of different language pairs. We will do so by providing datasets created in similar ways for four language language pairs. To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, and actual edits. Measure progress over years at all prediction levels. We will do so by using last year's test set for comparative experiments. In-house statistical and neural MT systems were built to produce translations for all tasks. MT system-dependent information can be made available under request. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.
Rights:: AGREEMENT ON THE USE OF DATA IN QT21, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

62. WMT18 Quality Estimation Shared Task Training and Development Data

Creator:: Specia, Lucia, Logacheva, Varvara, Blain, Frederic, Fernandez, Ramon, and Martins, André
Publisher:: University of Sheffield
Type:: text and corpus
Subject:: machine translation, quality estimation, and machine learning
Language:: English, German, Czech, and Latvian
Description:: Training and development data for the WMT18 QE task. Test data will be published as a separate item. This shared task will build on its previous six editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks make use of datasets produced from post-editions by professional translators. The datasets are domain-specific (IT and life sciences/pharma domains) and extend from those used previous years with more instances and more languages. One important addition is that this year we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are: To study the performance of quality estimation approaches on the output of neural MT systems. We will do so by providing datasets for two language language pairs where the same source segments are translated by both a statistical phrase-based and a neural MT system. To study the predictability of deleted words, i.e. words that are missing in the MT output. TO do so, for the first time we provide data annotated for such errors at training time. To study the effectiveness of explicitly assigned labels for phrases. We will do so by providing a dataset where each phrase in the output of a phrase-based statistical MT system was annotated by human translators. To study the effect of different language pairs. We will do so by providing datasets created in similar ways for four language language pairs. To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, and actual edits. Measure progress over years at all prediction levels. We will do so by using last year's test set for comparative experiments. In-house statistical and neural MT systems were built to produce translations for all tasks. MT system-dependent information can be made available under request. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.
Rights:: AGREEMENT ON THE USE OF DATA IN QT21, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

63. WMT21 Marian translation model (ca-oc multi-task)

Creator:: Novák, Michal and Jon, Josef
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: other and toolService
Subject:: neural machine translation, machine translation, grapheme-to-phoneme conversion, and multi-task model
Language:: Catalan and Occitan (post 1500)
Description:: Marian NMT model for Catalan to Occitan translation. It is a multi-task model, producing also a phonemic transcription of the Catalan source. The model was submitted to WMT'21 Shared Task on Multilingual Low-Resource Translation for Indo-European Languages as a CUNI-Contrastive system for Catalan to Occitan.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

64. WMT21 Marian translation model (ca-oc)

Creator:: Jon, Josef
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: other and toolService
Subject:: machine translation and neural machine translation
Language:: Catalan and Occitan (post 1500)
Description:: Marian NMT model for Catalan to Occitan translation. Primary CUNI submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from