This is the Czech data collected during the `VYSTADIAL` project. It is an extension of the 'Vystadial 2013' Czech part data release. The dataset comprises of telephone conversations in Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems.
A set of corpora for 120 languages automatically collected from wikipedia and the web.
Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) 1943 issue no. 26A from 1943 captures the Slavia vs. Pilsen water polo match that was a part of the Provincial Youth Swimming Championship organised by the Board of Trustees for the Education of Youth in cooperation with the Czech Amateur Swimming Union and held at the swimming pool in Prague-Barrandov on 3 and 4 July.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 29A from 1944 was shot during the Week of Czech Youth event organised by the Board of Trustees for the Education of Youth and held from 1 to 9 July. The programme included a concert held on Old Town Square on 8 July. The orchestra and choir consisted of several hundred young musicians and singers. Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec and Deputy Mayor Joseph Pfitzner watched the event from the balcony of the Old Town Hall. The Board of Trustees´ youth set out from the square in a parade through the streets of Prague. The following day, a sports afternoon took place at Strahov Stadium. Guests of honour included Prime Minister Jaroslav Krejčí and the General Secretary of the Board František Teuner. Emanuel Moravec spoke to the participants. The programme included women´s floor exercises, track and field races and women in stylised costumes dancing to folk songs. The event was concluded with the athletes and audience paying homage to Adolf Hitler.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 8A shows a speech delivered in Czech by Wenzel Jaksch, the MP for the German Social Democratic Workers' Party (DSAP), about the possible coexistence of, and understanding between, Czechs and Germans.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 9B from 1944 covered one of the events of the mandatory youth service organised by the Board of Trustees for the Education of Youth. Specifically, it consisted of recreational winter sports activities (skiing, sleighing) with the aim of improving physical fitness and strengthening collective spirit.
We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness, this record contains the 3000 sentences in all the WMT 2013 original languages (Czech, English, French, German, Russian and Spanish), extended with our Vietnamese version. Test set is used in [2] to evaluate translation between Czech, English and Vietnamese.
References
1. http://www.statmt.org/wmt13/evaluation-task.html
2. Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75--86, ISSN 1804-0462. 9/2015
Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2].
References:
[1] http://www.statmt.org/wmt11/evaluation-task.html
[2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press. and The work on this project was supported by the grant EuroMatrixPlus (FP7-ICT-
2007-3-231720 of the EU and 7E09003 of the Czech Republic)
The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.