Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2].
References:
[1] http://www.statmt.org/wmt11/evaluation-task.html
[2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press. and The work on this project was supported by the grant EuroMatrixPlus (FP7-ICT-
2007-3-231720 of the EU and 7E09003 of the Czech Republic)
The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
This dataset comprises a corpus of 50 text contexts, each about 60 words in length, sourced from five distinct domains. Each context has been evaluated by multiple annotators who identified and ranked the most important words—up to 10% of each text—according to their perceived significance. The annotators followed specific guidelines to ensure consistency in word selection and ranking. For further details, please refer to the cited source.
---
rankings_task.csv
- This csv contains information about the contexts which are to be annotated:
- id: A unique identifier for each task.
- content: The context to be ranked.
---
rankings_ranking.csv
- This csv includes ranking information for various assignments. It contains four columns:
- id: A unique identifier for each ranking entry.
- score: The score assigned to the entry.
- word_order: A JSON detailing the order of words positions. It is essentially the selected word positions and their ordering from an annotator.
- assignment_id: A reference ID linking to the assignments.
---
rankings_assignment.csv
- This csv tracks the completion status of tasks by users. It includes four columns:
- id: A unique identifier for each assignment entry.
- is_completed: A binary indicator (1 for completed, 0 for not completed).
- task_id: A reference ID linking to the tasks.
- user_id: The identifier for the user who should complete the task (rank the words).
---
Known Issues:
Please note that each annotator was intended to rank each context only once. However, due to a bug in the deployment of the annotation tool, some entries may be duplicated. Users of this dataset should be cautious of this issue and verify the uniqueness of the annotations where necessary.
---
This dataset is a part of work from a bachelor thesis:
OSUSKÝ, Adam. Predicting Word Importance Using Pre-Trained Language Models. Bachelor thesis, supervisor Javorský, Dávid. Prague: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, 2024.
Czech translation of WordSim353. The Czech translation of English WordSim353 word pairs were obtained from four translators. All translation variants were scored according to the lexical similarity/relatedness annotation instructions for WordSim353 annotators, by 25 Czech annotators. The resulting data set consists of two annotation files: "WordSim353-cs.csv" and "WordSim-cs-Multi.csv". Both files are encoded in UTF-8, have a header, text is enclosed in double quotes, and columns are separated by commas. The rows are numbered. The WordSim-cs-Multi data set has rows numbered from 1 to 634, whereas the row indices in the WordSim353-cs data set reflect the corresponding row numbers in the WordSim-cs-Multi data set.
The WordSim353-cs file contains a one-to-one mapping selection of 353 Czech equivalent pairs whose judgments have proven to be most similar to the judgments of their corresponding English originals (compared by the absolute value of the difference between the means over all annotators in each language counterpart). In one case ("psychology-cognition"), two Czech equivalent pairs had identical means as well as confidence intervals, so we randomly selected one.
The "WordSim-cs-Multi.csv" file contains human judgments for all translation variants.
In both data sets, we preserved all 25 individual scores. In the WordSim353-cs data set, we added a column with their Czech means as well as a column containing the original English means and 95% confidence intervals in separate columns for each mean (computed by the CI function in the Rmisc R package). The WordSim-cs-Multi data set contains only the Czech means and confidence intervals. For the most convenient lexical search, we provided separate columns with the respective Czech and English single words, entire word pairs, and eventually an English-Czech quadruple in both data sets.
The data set also contains an xls table with the four translations and a preliminary selection of the best variants performed by an adjudicator.