WMT16 Tuning Shared Task Models (English-to-Czech)

Name: WMT16 Tuning Shared Task Models (English-to-Czech)
License: http://creativecommons.org/licenses/by-nc-sa/4.0/

Kamran, Amir; Jawaid, Bushra; Bojar, Ondřej; Stanojevic, Milos

Zobrazit minimální záznam

dc.contributor.author	Kamran, Amir
dc.contributor.author	Jawaid, Bushra
dc.contributor.author	Bojar, Ondřej
dc.contributor.author	Stanojevic, Milos
dc.date.accessioned	2016-03-22T12:33:39Z
dc.date.available	2016-03-22T12:33:39Z
dc.date.issued	2016-03-21
dc.identifier.uri	http://hdl.handle.net/11372/LRT-1672
dc.description	This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl. Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
dc.language.iso	eng
dc.language.iso	ces
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.publisher	University of Amsterdam, ILLC
dc.relation	info:eu-repo/grantAgreement/EC/H2020/645452
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri	http://www.statmt.org/wmt16/tuning-task/
dc.subject	WMT16
dc.subject	machine translation
dc.subject	tuning
dc.subject	baseline models
dc.subject	shared task
dc.title	WMT16 Tuning Shared Task Models (English-to-Czech)
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LRT + Open Submissions
contact.person	Amir Kamran amirkamran@msn.com University of Amsterdam, ILLC
sponsor	European Union H2020-ICT-2014-1-645452 QT21: Quality Translation 21 euFunds info:eu-repo/grantAgreement/EC/H2020/645452
sponsor	Technology Foundation STW 12271 Data-Powered Domain-Specific Translation Services On Demand (DatAptor) Other
files.size	70667353277
files.count	5

Soubory tohoto záznamu

Licenční kategorie:

Publicly Available

Licence: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Název: en2cs_model.tgz
Velikost: 37.56 GB
Formát: application/x-gzip
Popis: Contains Lexical Models (lex.e2f and lex.f2e), Phrase Table (phrase-table.gz), word based reordering model (reordering-table.wbe-msd-bidirectional-fe.gz), hierarchical reordering model (reordering-table.hier-msd-bidirectional-fe.gz) and the moses.ini file
MD5: bed88bbeef3afc454c3f02845ab72769

Stáhnout soubor Náhled

Náhled souboru

- reordering-table.wbe-msd-bidirectional-fe.gz7 GB
- moses.ini1 kB
- lex.f2e668 MB
- reordering-table.hier-msd-bidirectional-fe.gz7 GB
- lex.e2f668 MB
- phrase-table.gz21 GB

Název: wmt16.czeng.blm.cs.tgz
Velikost: 9.28 GB
Formát: application/x-gzip
Popis: kenlm 5-gram language model (binarized) trained only on the Czech side of CzEng parallel data used
MD5: de338fd4ba04b82631aab9488c468cd6

Stáhnout soubor Náhled

Náhled souboru

- wmt16.czeng.blm.cs15 GB

Název: wmt16.mono.blm.cs.tgz
Velikost: 18.98 GB
Formát: application/x-gzip
Popis: kenlm 5-gram language model (binarized) trained on all Czech mono data available for WMT except Common Crawl (see the Makefile for the details of mono data used)
MD5: 6347aa8e420db60142cb1384ea1cab0d

Stáhnout soubor Náhled

Náhled souboru

- wmt16.mono.blm.cs32 GB

Název: Makefile
Velikost: 16.96 KB
Formát: Neznámý
Popis: You can recreate the models using this Makefile
MD5: 5f56434491ccb9591c35d8fe20fb8aa9

Stáhnout soubor

Název: moses.ini
Velikost: 1.29 KB
Formát: Neznámý
Popis: The moses.ini file for tuning
MD5: 8c60e67f303419ad03fee1fe00aef8cf

Stáhnout soubor

Zobrazit minimální záznam