Zobrazit minimální záznam

 
dc.contributor.author Specia, Lucia
dc.contributor.author Logacheva, Varvara
dc.contributor.author Blain, Frederic
dc.contributor.author Fernandez, Ramon
dc.contributor.author Martins, André
dc.date.accessioned 2018-02-19T13:59:24Z
dc.date.available 2018-02-19T13:59:24Z
dc.date.issued 2018-02-19
dc.identifier.uri http://hdl.handle.net/11372/LRT-2619
dc.description Training and development data for the WMT18 QE task. Test data will be published as a separate item. This shared task will build on its previous six editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks make use of datasets produced from post-editions by professional translators. The datasets are domain-specific (IT and life sciences/pharma domains) and extend from those used previous years with more instances and more languages. One important addition is that this year we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are: To study the performance of quality estimation approaches on the output of neural MT systems. We will do so by providing datasets for two language language pairs where the same source segments are translated by both a statistical phrase-based and a neural MT system. To study the predictability of deleted words, i.e. words that are missing in the MT output. TO do so, for the first time we provide data annotated for such errors at training time. To study the effectiveness of explicitly assigned labels for phrases. We will do so by providing a dataset where each phrase in the output of a phrase-based statistical MT system was annotated by human translators. To study the effect of different language pairs. We will do so by providing datasets created in similar ways for four language language pairs. To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, and actual edits. Measure progress over years at all prediction levels. We will do so by using last year's test set for comparative experiments. In-house statistical and neural MT systems were built to produce translations for all tasks. MT system-dependent information can be made available under request. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.
dc.language.iso eng
dc.language.iso deu
dc.language.iso ces
dc.language.iso lav
dc.publisher University of Sheffield
dc.relation info:eu-repo/grantAgreement/EC/H2020/645452
dc.relation.replaces http://hdl.handle.net/11372/LRT-1974
dc.rights AGREEMENT ON THE USE OF DATA IN QT21
dc.rights.uri https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21
dc.source.uri http://www.statmt.org/wmt18/quality-estimation-task.html
dc.subject machine translation
dc.subject quality estimation
dc.subject machine learning
dc.title WMT18 Quality Estimation Shared Task Training and Development Data
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
hidden false
hasMetadata false
has.files yes
branding LRT + Open Submissions
contact.person Lucia Specia l.specia@sheffield.ac.uk University of Sheffield
sponsor European Union H2020-ICT-2014-1-645452 QT21: Quality Translation 21 euFunds info:eu-repo/grantAgreement/EC/H2020/645452
files.size 47701641
files.count 4


 Soubory tohoto záznamu

 Stáhnout všechny soubory záznamu (45.49 MB)
Licenční kategorie:
Publicly Available

Licence: AGREEMENT ON THE USE OF DATA IN QT21
Icon
Název
sentence_level_training.tar.gz
Velikost
20.68 MB
Formát
application/x-gzip
Popis
Sentence-level Quality Estimation training data
MD5
bc5be3cd7950fb22ec728ce92ec8844d
 Stáhnout soubor
Icon
Název
word_level_training.tar.gz
Velikost
22.91 MB
Formát
application/x-gzip
Popis
Word-level Quality Estimation training data
MD5
d85f9eb198248a732c10fa645bf2fa08
 Stáhnout soubor
Icon
Název
phrase_level_training.tar.gz
Velikost
1.9 MB
Formát
application/x-gzip
Popis
Phrase-level Quality Estimation training data
MD5
796069b684b6b8b7a51d08317b38f342
 Stáhnout soubor
Icon
Název
phrase_level_training.tar.gz
Velikost
360 bajtů
Formát
application/x-gzip
Popis
Phrase-level Quality Estimation training data - version 2
MD5
317ee0f9c898bfbded13c2d49506257e
 Stáhnout soubor

Zobrazit minimální záznam