This data set contains four types of manual annotation of translation quality, focusing on the comparison of human and machine translation quality (aka human-parity). The machine translation system used is English-Czech CUNI Transformer (CUBBITT). The annotations distinguish adequacy, fluency and overall quality. One of the types is Translation Turing test - detecting whether the annotators can distinguish human from machine translation.
All the sentences are taken from the English-Czech test set newstest2018 (WMT2018 News translation shared task www.statmt.org/wmt18/translation-task.html), but only from the half with originally English sentences translated to Czech by a professional agency.
Manual classification of errors of Czech-Slovak translation according to the classification introduced by Vilar et al. [1]. First 50 sentences from WMT 2010 test set were translated by 5 MT systems (Česílko, Česílko2, Google Translate and two Moses setups) and MT errors were manually marked and classified. Classification was applied in MT systems comparison [3]. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://matrix.statmt.org/test_sets/list
[3] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011 and This work has been supported by the grants Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
Manual classification of errors of English-Slovak translation according to the classification introduced by Vilar et al. [1]. 50 sentences randomly selected from WMT 2011 test set [2] were translated by 3 MT systems described in [3] and MT errors were manually marked and classified. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://www.statmt.org/wmt11/evaluation-task.html
[3] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press. and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
Manually ranked outputs of Czech-Slovak translations. Three annotators manually ranked outputs of five MT systems (Česílko, Česílko2, Google Translate and two Moses setups) on three data sets (100 sentences randomly selected from books, 100 sentences randomly selected from Acquis corpus and 50 first sentences from WMT 2010 test set). Ranking was applied in MT systems comparison in [1].
References:
[1] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011 and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip).
Their main use should be in-domain translation of social surveys.
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on MCSQ test set (BLEU):
en->de: 67.5 (train: genuine in-domain MCSQ data only)
de->en: 75.0 (train: additional in-domain backtranslated MCSQ data)
(Evaluated using multeval: https://github.com/jhclark/multeval)
En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip).
Their main use should be in-domain translation of social surveys.
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on MCSQ test set (BLEU):
en->ru: 64.3 (train: genuine in-domain MCSQ data)
ru->en: 74.7 (train: additional backtranslated in-domain MCSQ data)
(Evaluated using multeval: https://github.com/jhclark/multeval)
MTMonkey is a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre- and post processing.
It consists of an application server and remote workers which handle text processing and communicate translation requests to MT systems. The communication between the application server and the workers is based on the XML-RPC protocol. and The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 257528 (KHRESMOI). This work has been using language resources developed and/or stored and/or distributed by the LINDAT-Clarin project of the Ministry of Education of the Czech Republic (project LM2010013). This work has been supported by the AMALACH grant (DF12P01OVV02) of the Ministry of Culture of the Czech Republic.
Data
-----
We have collected English-Odia parallel data for the purposes of NLP
research of the Odia language.
The data for the parallel corpus was extracted from existing parallel
corpora such as OdiEnCorp 1.0 and PMIndia, and books which contain both
English and Odia text such as grammar and bilingual literature books. We
also included parallel text from multiple public websites such as Odia
Wikipedia, Odia digital library, and Odisha Government websites.
The parallel corpus covers many domains: the Bible, other literature,
Wiki data relating to many topics, Government policies, and general
conversation. We have processed the raw data collected from the books,
websites, performed sentence alignments (a mix of manual and automatic
alignments) and released the corpus in a form suitable for various NLP
tasks.
Corpus Format
-------------
OdiEnCorp 2.0 is stored in simple tab-delimited plain text files, each
with three tab-delimited columns:
- a coarse indication of the domain
- the English sentence
- the corresponding Odia sentence
The corpus is shuffled at the level of sentence pairs.
The coarse domains are:
books ... prose text
dict ... dictionaries and phrasebooks
govt ... partially formal text
odiencorp10 ... OdiEnCorp 1.0 (mix of domains)
pmindia ... PMIndia (the original corpus)
wikipedia ... sentences and phrases from Wikipedia
Data Statistics
---------------
The statistics of the current release are given below.
Note that the statistics differ from those reported in the paper due to
deduplication at the level of sentence pairs. The deduplication was
performed within each of the dev set, test set and training set and
taking the coarse domain indication into account. It is still possible
that the same sentence pair appears more than once within the same set
(dev/test/train) if it came from different domains, and it is also
possible that a sentence pair appears in several sets (dev/test/train).
Parallel Corpus Statistics
--------------------------
Dev Dev Dev Test Test Test Train Train Train
Sents # EN # OD Sents # EN # OD Sents # EN # OD
books 3523 42011 36723 3895 52808 45383 3129 40461 35300
dict 3342 14580 13838 3437 14807 14110 5900 21591 20246
govt - - - - - - 761 15227 13132
odiencorp10 947 21905 19509 1259 28473 24350 26963 704114 602005
pmindia 3836 70282 61099 3836 68695 59876 30687 551657 486636
wikipedia 1896 9388 9385 1917 21381 20951 1930 7087 7122
Total 13544 158166 140554 14344 186164 164670 69370 1340137 1164441
"Sents" are the counts of the sentence pairs in the given set (dev/test/train)
and domain (books/dict/...).
"# EN" and "# OD" are approximate counts of words (simply space-delimited,
without tokenization) in English and Odia
The total number of sentence pairs (lines) is 13544+14344+69370=97258. Ignoring
the set and domain and deduplicating again, this number drops to 94857.
Citation
--------
If you use this corpus, please cite the following paper:
@inproceedings{parida2020odiencorp,
title={OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation},
author={Parida, Shantipriya and Dash, Satya Ranjan and Bojar, Ond{\v{r}}ej and Motlicek, Petr and Pattnaik, Priyanka and Mallick, Debasish Kumar},
booktitle={Proceedings of the WILDRE5--5th Workshop on Indian Language Data: Resources and Evaluation},
pages={14--19},
year={2020}
}