This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it provides complete query translations into Czech, French, German, Hungarian, Polish, Spanish and Swedish and additional relevance assessment.
Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test sets.
The English data includes manual annotations of English reference translations of Czech source texts. This texts were translated independently by two translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. Both the reference translations were annotated, which means 2000 annotated segments in total.
The Czech data includes manual annotations of Czech reference translations of English source texts. This texts were translated independently by three translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. All three reference translations were annotated, which means 3000 annotated segments in total.
Faust is part of PDT-C 1.0 (http://hdl.handle.net/11234/1-3185).
This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans Swedish. The queries come from general public and medical experts. This is version 2.0 extending the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech, English, French, German, Hungarian, Polish, Spanish
and Swedish. Version 2.0 extends the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.
Document-level testsuite for evaluation of gender translation consistency.
Our Document-Level test set consists of selected English documents from the WMT21 newstest annotated with gender information. Czech unnanotated references are also added for convenience.
We semi-automatically annotated person names and pronouns to identify the gender of these elements as well as coreferences.
Our proposed annotation consists of three elements: (1) an ID, (2) an element class, and (3) gender.
The ID identifies a person's name and its occurrences (name and pronouns).
The element class identifies whether the tag refers to a name or a pronoun.
Finally, the gender information defines whether the element is masculine or feminine.
We performed a series of NLP techniques to automatically identify person names and coreferences.
This initial process resulted in a set containing 45 documents to be manually annotated.
Thus, we started a manual annotation of these documents to make sure they are correctly tagged.
See README.md for more details.