A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc.
A set of 120 corpora collected using this tool is available at https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0022-6133-9
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) 1943 issue no. 26A from 1943 captures the Slavia vs. Pilsen water polo match that was a part of the Provincial Youth Swimming Championship organised by the Board of Trustees for the Education of Youth in cooperation with the Czech Amateur Swimming Union and held at the swimming pool in Prague-Barrandov on 3 and 4 July.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 29A from 1944 was shot during the Week of Czech Youth event organised by the Board of Trustees for the Education of Youth and held from 1 to 9 July. The programme included a concert held on Old Town Square on 8 July. The orchestra and choir consisted of several hundred young musicians and singers. Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec and Deputy Mayor Joseph Pfitzner watched the event from the balcony of the Old Town Hall. The Board of Trustees´ youth set out from the square in a parade through the streets of Prague. The following day, a sports afternoon took place at Strahov Stadium. Guests of honour included Prime Minister Jaroslav Krejčí and the General Secretary of the Board František Teuner. Emanuel Moravec spoke to the participants. The programme included women´s floor exercises, track and field races and women in stylised costumes dancing to folk songs. The event was concluded with the athletes and audience paying homage to Adolf Hitler.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 8A shows a speech delivered in Czech by Wenzel Jaksch, the MP for the German Social Democratic Workers' Party (DSAP), about the possible coexistence of, and understanding between, Czechs and Germans.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 9B from 1944 covered one of the events of the mandatory youth service organised by the Board of Trustees for the Education of Youth. Specifically, it consisted of recreational winter sports activities (skiing, sleighing) with the aim of improving physical fitness and strengthening collective spirit.
We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness, this record contains the 3000 sentences in all the WMT 2013 original languages (Czech, English, French, German, Russian and Spanish), extended with our Vietnamese version. Test set is used in [2] to evaluate translation between Czech, English and Vietnamese.
References
1. http://www.statmt.org/wmt13/evaluation-task.html
2. Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75--86, ISSN 1804-0462. 9/2015
Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2].
References:
[1] http://www.statmt.org/wmt11/evaluation-task.html
[2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press. and The work on this project was supported by the grant EuroMatrixPlus (FP7-ICT-
2007-3-231720 of the EU and 7E09003 of the Czech Republic)
Training, development and text data (the same used for the Sentence-level Quality Estimation task) consist in English-German triplets (source, target and post-edit) belonging to the IT domain and already tokenized.
Training and development respectively contain 12,000 and 1,000 triplets, while the test set 2,000 instances. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Training, development and test data consist in German sentences belonging to the IT domain and already tokenized. These sentences are the references of the data released for the 2016 edition of the WMT APE shared task. Differently from the data previously released, these sentences are obtained by manually translating the source sentence without leveraging the raw mt outputs. Training and development respectively contain 12,000 and 1,000 segments, while the test set 2,000 items. All data is provided by the EU project QT21 (http://www.qt21.eu/).