dc.contributor.author | Kamran, Amir |
dc.contributor.author | Jawaid, Bushra |
dc.contributor.author | Bojar, Ondřej |
dc.contributor.author | Stanojevic, Milos |
dc.date.accessioned | 2016-03-22T12:33:39Z |
dc.date.available | 2016-03-22T12:33:39Z |
dc.date.issued | 2016-03-21 |
dc.identifier.uri | http://hdl.handle.net/11372/LRT-1672 |
dc.description | This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl. Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng. |
dc.language.iso | eng |
dc.language.iso | ces |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.publisher | University of Amsterdam, ILLC |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/645452 |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.source.uri | http://www.statmt.org/wmt16/tuning-task/ |
dc.subject | WMT16 |
dc.subject | machine translation |
dc.subject | tuning |
dc.subject | baseline models |
dc.subject | shared task |
dc.title | WMT16 Tuning Shared Task Models (English-to-Czech) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LRT + Open Submissions |
contact.person | Amir Kamran amirkamran@msn.com University of Amsterdam, ILLC |
sponsor | European Union H2020-ICT-2014-1-645452 QT21: Quality Translation 21 euFunds info:eu-repo/grantAgreement/EC/H2020/645452 |
sponsor | Technology Foundation STW 12271 Data-Powered Domain-Specific Translation Services On Demand (DatAptor) Other |
files.size | 70667353277 |
files.count | 5 |
Files in this item
This item is
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
- Name
- en2cs_model.tgz
- Size
- 37.56 GB
- Format
- application/x-gzip
- Description
- Contains Lexical Models (lex.e2f and lex.f2e), Phrase Table (phrase-table.gz), word based reordering model (reordering-table.wbe-msd-bidirectional-fe.gz), hierarchical reordering model (reordering-table.hier-msd-bidirectional-fe.gz) and the moses.ini file
- MD5
- bed88bbeef3afc454c3f02845ab72769
- Name
- wmt16.czeng.blm.cs.tgz
- Size
- 9.28 GB
- Format
- application/x-gzip
- Description
- kenlm 5-gram language model (binarized) trained only on the Czech side of CzEng parallel data used
- MD5
- de338fd4ba04b82631aab9488c468cd6
- Name
- wmt16.mono.blm.cs.tgz
- Size
- 18.98 GB
- Format
- application/x-gzip
- Description
- kenlm 5-gram language model (binarized) trained on all Czech mono data available for WMT except Common Crawl (see the Makefile for the details of mono data used)
- MD5
- 6347aa8e420db60142cb1384ea1cab0d
- Name
- Makefile
- Size
- 16.96 KB
- Format
- Unknown
- Description
- You can recreate the models using this Makefile
- MD5
- 5f56434491ccb9591c35d8fe20fb8aa9
- Name
- moses.ini
- Size
- 1.29 KB
- Format
- Unknown
- Description
- The moses.ini file for tuning
- MD5
- 8c60e67f303419ad03fee1fe00aef8cf