Number of results to display per page
Search Results
2232. Wmatrix
- Publisher:
- Lancaster University
- Type:
- toolService
- Language:
- English
- Description:
- Wmatrix is a corpus comparison and annotation tool. It is web based and incorporates the CLAWS POS tagger and the USAS semantic tagger for English. It also generates frequency lists, concordances, key words and key semantic domains by comparative frequency profiling.
- Rights:
- Not specified
2233. WMT 13 Test Set
- Creator:
- Hoang, Duc Tam and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- test data, parallel corpus, and Vietnamese
- Language:
- Vietnamese, Czech, English, German, French, Spanish, and Russian
- Description:
- We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness, this record contains the 3000 sentences in all the WMT 2013 original languages (Czech, English, French, German, Russian and Spanish), extended with our Vietnamese version. Test set is used in [2] to evaluate translation between Czech, English and Vietnamese. References 1. http://www.statmt.org/wmt13/evaluation-task.html 2. Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75--86, ISSN 1804-0462. 9/2015
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
2234. WMT 2011 Testing Set
- Creator:
- Galuščáková, Petra and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- WMT, test data, and Slovak
- Language:
- Slovak, Czech, and English
- Description:
- Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2]. References: [1] http://www.statmt.org/wmt11/evaluation-task.html [2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press. and The work on this project was supported by the grant EuroMatrixPlus (FP7-ICT- 2007-3-231720 of the EU and 7E09003 of the Czech Republic)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
2235. WMT16 APE Shared Task Data
- Creator:
- Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
- Publisher:
- Fondazione Bruno Kessler, Trento, Italy
- Type:
- text and corpus
- Subject:
- machine translation, machine learning, automatic postediting, and shared task
- Language:
- English and German
- Description:
- Training, development and text data (the same used for the Sentence-level Quality Estimation task) consist in English-German triplets (source, target and post-edit) belonging to the IT domain and already tokenized. Training and development respectively contain 12,000 and 1,000 triplets, while the test set 2,000 instances. All data is provided by the EU project QT21 (http://www.qt21.eu/).
- Rights:
- AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB
2236. WMT16 APE Shared Task Data - Reference sentences
- Creator:
- Turchi, Marco, Negri, Matteo, and Chatterjee, Rajen
- Publisher:
- Fondazione Bruno Kessler, Trento, Italy
- Type:
- text and corpus
- Subject:
- machine translation, machine learning, automatic post-editing, and shared task
- Language:
- German
- Description:
- Training, development and test data consist in German sentences belonging to the IT domain and already tokenized. These sentences are the references of the data released for the 2016 edition of the WMT APE shared task. Differently from the data previously released, these sentences are obtained by manually translating the source sentence without leveraging the raw mt outputs. Training and development respectively contain 12,000 and 1,000 segments, while the test set 2,000 items. All data is provided by the EU project QT21 (http://www.qt21.eu/).
- Rights:
- AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB
2237. WMT16 Quality Estimation Shared Task Training and Development Data
- Creator:
- Specia, Lucia, Logacheva, Varvara, and Scarton, Carolina
- Publisher:
- University of Sheffield
- Type:
- text and corpus
- Subject:
- machine translation, quality estimation, and machine learning
- Language:
- English and German
- Description:
- Training and development data for the WMT16 QE task. Test data will be published as a separate item. This shared task will build on its previous four editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, sentence-level and document-level estimation. The sentence and word-level tasks will explore a large dataset produced from post-editions by professional translators (as opposed to crowdsourced translations as in the previous year). For the first time, the data will be domain-specific (IT domain). The document-level task will use, for the first time, entire documents, which have been human annotated for quality indirectly in two ways: through reading comprehension tests and through a two-stage post-editing exercise. Our tasks have the following goals: - To advance work on sentence and word-level quality estimation by providing domain-specific, larger and professionally annotated datasets. - To study the utility of detailed information logged during post-editing (time, keystrokes, actual edits) for different levels of prediction. - To analyse the effectiveness of different types of quality labels provided by humans for longer texts in document-level prediction. This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for the sentence and word-level tasks, and multiple MT systems were used to produce translations for the document-level task. Therefore, MT system-dependent information will be made available where possible.
- Rights:
- AGREEMENT ON THE USE OF DATA IN QT21, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB
2238. WMT16 Tuning Shared Task Models (Czech-to-English)
- Creator:
- Kamran, Amir, Jawaid, Bushra, Bojar, Ondřej, and Stanojevic, Milos
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and University of Amsterdam, ILLC
- Type:
- text and corpus
- Subject:
- WMT16, machine translation, tuning, baseline models, and shared task
- Language:
- Czech and English
- Description:
- The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl. Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
2239. WMT16 Tuning Shared Task Models (English-to-Czech)
- Creator:
- Kamran, Amir, Jawaid, Bushra, Bojar, Ondřej, and Stanojevic, Milos
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and University of Amsterdam, ILLC
- Type:
- text and corpus
- Subject:
- WMT16, machine translation, tuning, baseline models, and shared task
- Language:
- English and Czech
- Description:
- This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl. Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
2240. WMT17 De-En APE Shared Task Data
- Creator:
- Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
- Publisher:
- Fondazione Bruno Kessler, Trento, Italy
- Type:
- text and corpus
- Subject:
- machine translation, shared task, automatic post-editing, and post-editing
- Language:
- German and English
- Description:
- Training and development data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in German-English triplets (source, target and post-edit) belonging to the pharmacological domain and already tokenized. Training and development respectively contain 25,000 and 1,000 triplets. All data is provided by the EU project QT21 (http://www.qt21.eu/).
- Rights:
- AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB