Document-level testsuite for evaluation of gender translation consistency.
Our Document-Level test set consists of selected English documents from the WMT21 newstest annotated with gender information. Czech unnanotated references are also added for convenience.
We semi-automatically annotated person names and pronouns to identify the gender of these elements as well as coreferences.
Our proposed annotation consists of three elements: (1) an ID, (2) an element class, and (3) gender.
The ID identifies a person's name and its occurrences (name and pronouns).
The element class identifies whether the tag refers to a name or a pronoun.
Finally, the gender information defines whether the element is masculine or feminine.
We performed a series of NLP techniques to automatically identify person names and coreferences.
This initial process resulted in a set containing 45 documents to be manually annotated.
Thus, we started a manual annotation of these documents to make sure they are correctly tagged.
See README.md for more details.
Moroccan Dialect Electronic Dictionary (MDED) is an electronic lexicon containing almost 15000 MSA entries. They are written in Arabic letters and translated to Moroccan Arabic dialect. In addition, MDED entries are annotated useful metadata such as POS, Origin and root. MDED can be useful in some advanced NLP applications such as Machine translation and morphological analyzer.