This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years 2011, 2013 and 2014.
For each sentence, at most 10000 paraphrases were included (randomly selected from the full set).
The goal of using this dataset is to improve automatic evaluation of machine translation outputs.
If you use this work, please cite the following paper:
Tamchyna Aleš, Barančíková Petra: Automatic and Manual Paraphrases for MT Evaluation. In proceedings of LREC, 2016.
English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following:
1- Manually corrected sentence alignment of the corpora.
2- Our data split (training-development-test) so that our published experiments can be reproduced.
3- Tokenization (optional, but needed to reproduce our experiments).
4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it provides complete query translations into Czech, French, German, Hungarian, Polish, Spanish and Swedish and additional relevance assessment.
This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308).
Each original (noisy) sentence was normalized (clean1 and clean2) and translated to English independently by two translators.
This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from general public and medical experts. and This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013).
We thank Health on the Net Foundation for granting the license for the English general public queries, TRIP database for granting the license for the English medical expert queries, and three anonymous translators and three medical experts for translating amd revising the data.
This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans Swedish. The queries come from general public and medical experts. This is version 2.0 extending the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
This package contains data sets for development and testing of machine translation of sentences from summaries of medical articles between Czech, English, French, and German. and This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013). We thank all the data providers and copyright holders for providing the source data and anonymous experts for translating the sentences.
This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech, English, French, German, Hungarian, Polish, Spanish
and Swedish. Version 2.0 extends the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
"Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in five different languages (EN, CS, DE, IT, HI).
Document-level testsuite for evaluation of gender translation consistency.
Our Document-Level test set consists of selected English documents from the WMT21 newstest annotated with gender information. Czech unnanotated references are also added for convenience.
We semi-automatically annotated person names and pronouns to identify the gender of these elements as well as coreferences.
Our proposed annotation consists of three elements: (1) an ID, (2) an element class, and (3) gender.
The ID identifies a person's name and its occurrences (name and pronouns).
The element class identifies whether the tag refers to a name or a pronoun.
Finally, the gender information defines whether the element is masculine or feminine.
We performed a series of NLP techniques to automatically identify person names and coreferences.
This initial process resulted in a set containing 45 documents to be manually annotated.
Thus, we started a manual annotation of these documents to make sure they are correctly tagged.
See README.md for more details.