This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it provides complete query translations into Czech, French, German, Hungarian, Polish, Spanish and Swedish and additional relevance assessment.
We have created test set for syntactic questions presented in the paper [1] which is more general than Mikolov's [2]. Since we were interested in morphosyntactic relations, we extended only the questions of the syntactic type with exception of nationality adjectives which is already covered completely in Mikolov's test set.
We constructed the pairs more or less manually, taking inspiration in the Czech side of the CzEng corpus [3], where explicit morphological annotation allows to identify various pairs of Czech words (different grades of adjectives, words and their negations, etc.). The word-aligned English words often shared the same properties. Another sources of pairs were acquired from various webpages usually written for learners of English. For example for verb tense, we relied on a freely available list of English verbs and their morphological variations.
We have included 100-1000 different pairs for each question set. The questions were constructed from the pairs similarly as by Mikolov: generating all possible pairs of pairs. This leads to millions of questions, so we randomly selected 1000 instances per question set, to keep the test set in the same order of magnitude. Additionally, we decided to extend set of questions on opposites to cover not only opposites of adjectives but also of nouns and verbs.
This package contains the eye-tracker recordings of 8 subjects evaluating English-to-Czech machine translation quality using the WMT-style ranking of sentences.
We provide the set of sentences evaluated, the exact screens presented to the annotators (including bounding box information for every area of interest and even for individual letters in the text) and finally the raw EyeLink II files with gaze trajectories.
The description of the experiment can be found in the paper:
Ondřej Bojar, Filip Děchtěrenko, Maria Zelenina. A Pilot Eye-Tracking Study of WMT-Style Ranking Evaluation.
Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools
and Data Sets to an Integrated Ecosystem”, Georg Rehm, Aljoscha Burchardt et al. (eds.). pp. 20-26. May 2016, Portorož, Slovenia.
This work has received funding from the European Union's Horizon 2020 research
and innovation programme under grant agreement no. 645452 (QT21). This work was
partially financially supported by the Government of Russian Federation, Grant
074-U01.
This work has been using language resources developed, stored and distributed
by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of
the Czech Republic (project LM2010013).
Fairytale Child is a simple chatbot trying to simulate a curious child. It asks the user to tell a fairy tale, often interrupting to ask for details and clarifications. However, it remembers what it was told and tries to show it if possible.
The chatbot can communicate in Czech and in English. It analyzes the morphology of each sentence produced by the user with natural language processing tools, tries to identify potential questions to ask, and then asks one. A morphological generator is employed to generate correctly inflected sentences in Czech, so that the resulting sentences sound as natural as possible.
Pohádkové dítě je jednoduchý chatbot, simulující zvídavé dítě. Požádá uživatele, aby mu vyprávěl pohádku, ale často ho přerušuje, aby se zeptal na detaily a vysvětlení. Pamatuje si ale, co mu uživatel řekl, a snaží se to pokud možno dát najevo.
Chatbot umí komunikovat česky a anglicky. Analyzuje tvarosloví každé uživatelovy věty pomocí NLP nástrojů, pokusí se nalézt chodnou otázku, a tu pak položí. Aby tvořené české věty zněly co nejpřirozeněji, využívá se pro skloňování tvaroslovný generátor. and The work has been supported by GAUK 1572314 and SVV 260104.
It has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013).
Fairytale Child is a simple chatbot trying to simulate a curious child. It asks the user to tell a fairy tale, often interrupting to ask for details and clarifications. However, it remembers what it was told and tries to show it if possible.
The chatbot can communicate in Czech and in English. It analyzes the morphology of each sentence produced by the user with natural language processing tools, tries to identify potential questions to ask, and then asks one. A morphological generator is employed to generate correctly inflected sentences in Czech, so that the resulting sentences sound as natural as possible.
Pohádkové dítě je jednoduchý chatbot, simulující zvídavé dítě. Požádá uživatele, aby mu vyprávěl pohádku, ale často ho přerušuje, aby se zeptal na detaily a vysvětlení. Pamatuje si ale, co mu uživatel řekl, a snaží se to pokud možno dát najevo.
Chatbot umí komunikovat česky a anglicky. Analyzuje tvarosloví každé uživatelovy věty pomocí NLP nástrojů, pokusí se nalézt chodnou otázku, a tu pak položí. Aby tvořené české věty zněly co nejpřirozeněji, využívá se pro skloňování tvaroslovný generátor. and The work has been supported by GAUK 1572314 and SVV 260104.
It has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013).
Fairytale Child is a simple chatbot trying to simulate a curious child. It asks the user to tell a fairy tale, often interrupting to ask for details and clarifications. However, it remembers what it was told and tries to show it if possible.
The chatbot can communicate in Czech and in English. It analyzes the morphology of each sentence produced by the user with natural language processing tools, tries to identify potential questions to ask, and then asks one. A morphological generator is employed to generate correctly inflected sentences in Czech, so that the resulting sentences sound as natural as possible.
Pohádkové dítě je jednoduchý chatbot, simulující zvídavé dítě. Požádá uživatele, aby mu vyprávěl pohádku, ale často ho přerušuje, aby se zeptal na detaily a vysvětlení. Pamatuje si ale, co mu uživatel řekl, a snaží se to pokud možno dát najevo.
Chatbot umí komunikovat česky a anglicky. Analyzuje tvarosloví každé uživatelovy věty pomocí NLP nástrojů, pokusí se nalézt chodnou otázku, a tu pak položí. Aby tvořené české věty zněly co nejpřirozeněji, využívá se pro skloňování tvaroslovný generátor. and The work has been supported by GAUK 1572314 and SVV 260104.
It has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013).
Fairytale Child is a simple chatbot trying to simulate a curious child. It asks the user to tell a fairy tale, often interrupting to ask for details and clarifications. However, it remembers what it was told and tries to show it if possible.
The chatbot can communicate in Czech and in English. It analyzes the morphology of each sentence produced by the user with natural language processing tools, tries to identify potential questions to ask, and then asks one. A morphological generator is employed to generate correctly inflected sentences in Czech, so that the resulting sentences sound as natural as possible.
Pohádkové dítě je jednoduchý chatbot, simulující zvídavé dítě. Požádá uživatele, aby mu vyprávěl pohádku, ale často ho přerušuje, aby se zeptal na detaily a vysvětlení. Pamatuje si ale, co mu uživatel řekl, a snaží se to pokud možno dát najevo.
Chatbot umí komunikovat česky a anglicky. Analyzuje tvarosloví každé uživatelovy věty pomocí NLP nástrojů, pokusí se nalézt chodnou otázku, a tu pak položí. Aby tvořené české věty zněly co nejpřirozeněji, využívá se pro skloňování tvaroslovný generátor. and The work has been supported by GAUK 1572314 and SVV 260104.
It has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013).
Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test sets.
The English data includes manual annotations of English reference translations of Czech source texts. This texts were translated independently by two translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. Both the reference translations were annotated, which means 2000 annotated segments in total.
The Czech data includes manual annotations of Czech reference translations of English source texts. This texts were translated independently by three translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. All three reference translations were annotated, which means 3000 annotated segments in total.
Faust is part of PDT-C 1.0 (http://hdl.handle.net/11234/1-3185).
This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308).
Each original (noisy) sentence was normalized (clean1 and clean2) and translated to English independently by two translators.
HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style.
Update (November 1017): for a current collection of harmonized dependency treebanks, we recommend using the Universal Dependencies (UD). All of the corpora that are distributed in HamleDT in full are also part of the UD project; only some corpora from the Patch group (where HamleDT provides only the harmonizing scripts but not the full corpus data) are available in HamleDT but not in UD.