CUBBITT En-Cs translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->cs: 27.6
cs->en: 34.4
(Evaluated using multeval: https://github.com/jhclark/multeval)
CUBBITT En-Fr translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->fr: 38.2
fr->en: 36.7
(Evaluated using multeval: https://github.com/jhclark/multeval)
CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2020 (BLEU):
en->pl: 12.3
pl->en: 20.0
(Evaluated using multeval: https://github.com/jhclark/multeval)
CzEng 1.0 is the fourth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes.
CzEng 1.0 contains 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep (a- and t-) layers of syntactic representation. and EuroMatrix Plus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic),
Faust (FP7-ICT-2009-4-247762 of the EU and 7E11041 of the Ministry of Education, Youth and Sports of the Czech Republic),
GAČR P406/10/P259,
GAUK 116310,
GAUK 4226/2011
We define "optimal reference translation" as a translation thought to be the best possible that can be achieved by a team of human translators. Optimal reference translations can be used in assessments of excellent machine translations.
We selected 50 documents (online news articles, with 579 paragraphs in total) from the 130 English documents included in the WMT2020 news test (http://www.statmt.org/wmt20/) with the aim to preserve diversity (style, genre etc.) of the selection. In addition to the official Czech reference translation provided by the WMT organizers (P1), we hired two additional translators (P2 and P3, native Czech speakers) via a professional translation agency, resulting in three independent translations. The main contribution of this dataset are two additional translations (i.e. optimal reference translations N1 and N2), done jointly by two translators-cum-theoreticians with an extreme care for various aspects of translation quality, while taking into account the translations P1-P3. We publish also internal comments (in Czech) for some of the segments.
Translation N1 should be closer to the English original (with regards to the meaning and linguistic structure) and female surnames use the Czech feminine suffix (e.g. "Mai" is translated as "Maiová"). Translation N2 is more free, trying to be more creative, idiomatic and entertaining for the readers and following the typical style used in Czech media, while still preserving the rules of functional equivalence. Translation N2 is missing for the segments where it was not deemed necessary to provide two alternative translations. For applications/analyses needing translation of all segments, this should be interpreted as if N2 is the same as N1 for a given segment.
We provide the dataset in two formats: OpenDocument spreadsheet (odt) and plain text (one file for each translation and the English original). Some words were highlighted using different colors during the creation of optimal reference translations; this highlighting and comments are present only in the odt format (some comments refer to row numbers in the odt file). Documents are separated by empty lines and each document starts with a special line containing the document name (e.g. "# upi.205735"), which allows alignment with the original WMT2020 news test. For the segments where N2 translations are missing in the odt format, the respective N1 segments are used instead in the plain-text format.
This corpus contains annotations of translation quality from English to Czech in seven categories on both segment- and document-level. There are 20 documents in total, each with 4 translations (evaluated by each annotator in paralel) of 8 segments (can be longer than one sentence). Apart from the evaluation, the annotators also proposed their own, improved versions of the translations.
There were 11 annotators in total, on expertise levels ranging from non-experts to professional translators.