Zobrazit minimální záznam

 
dc.contributor.author Kamran, Amir
dc.contributor.author Jawaid, Bushra
dc.contributor.author Bojar, Ondřej
dc.contributor.author Stanojevic, Milos
dc.date.accessioned 2016-03-22T12:33:39Z
dc.date.available 2016-03-22T12:33:39Z
dc.date.issued 2016-03-21
dc.identifier.uri http://hdl.handle.net/11372/LRT-1672
dc.description This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl. Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
dc.language.iso eng
dc.language.iso ces
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.publisher University of Amsterdam, ILLC
dc.relation info:eu-repo/grantAgreement/EC/H2020/645452
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri http://www.statmt.org/wmt16/tuning-task/
dc.subject WMT16
dc.subject machine translation
dc.subject tuning
dc.subject baseline models
dc.subject shared task
dc.title WMT16 Tuning Shared Task Models (English-to-Czech)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LRT + Open Submissions
contact.person Amir Kamran amirkamran@msn.com University of Amsterdam, ILLC
sponsor European Union H2020-ICT-2014-1-645452 QT21: Quality Translation 21 euFunds info:eu-repo/grantAgreement/EC/H2020/645452
sponsor Technology Foundation STW 12271 Data-Powered Domain-Specific Translation Services On Demand (DatAptor) Other
files.size 70667353277
files.count 5


 Soubory tohoto záznamu

Icon
Název
en2cs_model.tgz
Velikost
37.56 GB
Formát
application/x-gzip
Popis
Contains Lexical Models (lex.e2f and lex.f2e), Phrase Table (phrase-table.gz), word based reordering model (reordering-table.wbe-msd-bidirectional-fe.gz), hierarchical reordering model (reordering-table.hier-msd-bidirectional-fe.gz) and the moses.ini file
MD5
bed88bbeef3afc454c3f02845ab72769
 Stáhnout soubor  Náhled
 Náhled souboru  
    • reordering-table.wbe-msd-bidirectional-fe.gz7 GB
    • moses.ini1 kB
    • lex.f2e668 MB
    • reordering-table.hier-msd-bidirectional-fe.gz7 GB
    • lex.e2f668 MB
    • phrase-table.gz21 GB
Icon
Název
wmt16.czeng.blm.cs.tgz
Velikost
9.28 GB
Formát
application/x-gzip
Popis
kenlm 5-gram language model (binarized) trained only on the Czech side of CzEng parallel data used
MD5
de338fd4ba04b82631aab9488c468cd6
 Stáhnout soubor  Náhled
 Náhled souboru  
    • wmt16.czeng.blm.cs15 GB
Icon
Název
wmt16.mono.blm.cs.tgz
Velikost
18.98 GB
Formát
application/x-gzip
Popis
kenlm 5-gram language model (binarized) trained on all Czech mono data available for WMT except Common Crawl (see the Makefile for the details of mono data used)
MD5
6347aa8e420db60142cb1384ea1cab0d
 Stáhnout soubor  Náhled
 Náhled souboru  
    • wmt16.mono.blm.cs32 GB
Icon
Název
Makefile
Velikost
16.96 KB
Formát
Neznámý
Popis
You can recreate the models using this Makefile
MD5
5f56434491ccb9591c35d8fe20fb8aa9
 Stáhnout soubor
Icon
Název
moses.ini
Velikost
1.29 KB
Formát
Neznámý
Popis
The moses.ini file for tuning
MD5
8c60e67f303419ad03fee1fe00aef8cf
 Stáhnout soubor

Zobrazit minimální záznam