EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT.
EngVallex 2.0 as a slightly updated version of EngVallex. It is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank (English predicate-argument lexicon). The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank(s), which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT. EngVallex 2.0 is the same dataset as the EngVallex lexicon packaged with the PCEDT 3.0 corpus, but published separately under a more permissive licence, avoiding the need for LDC licence which is tied to PCEDT 3.0 as a whole.
EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.
ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations.
The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable.
The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps.
The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous).
The current version of ESIC is v1.0. It has validation and evaluation parts.
ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations.
The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable.
The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps.
The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous).
ESIC has validation and evaluation parts.
The current version is ESIC v1.1, it extends v1.0 with manual sentence alignment of the tri-parallel texts, and with bi-parallel sentence alignment of English original transcripts and German interpreting.
This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu).
The texts are sentences from the Europarl parallel corpus (Koehn, 2005). We selected the monolingual sentences from parallel corpora for the following pairs: Bulgarian-English, Czech-English, Portuguese-English and Spanish-English. The English corpus is comprised by the English side of the Spanish-English corpus.
Basque is not in Europarl. In addition, it contains the Basque and English sides of the GNOME corpus.
The texts have been automatically annotated with NLP tools, including Word Sense Disambiguation, Named Entity Disambiguation and Coreference resolution. Please check deliverable D5.6 in http://qtleap.eu/deliverables for more information.
The file contains the charts, tables and figures serving to delineate the metaphor-metonymy cognitive mechanism behind English denominal verbs. The data was obtained by questionnaires and interviews, which was then documented into charts and tables. Figures submitted mainly provide clear outline and concise outline of the metaphor-metonymy models of denominalization.
This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it provides complete query translations into Czech, French, German, Hungarian, Polish, Spanish and Swedish and additional relevance assessment.
We have created test set for syntactic questions presented in the paper [1] which is more general than Mikolov's [2]. Since we were interested in morphosyntactic relations, we extended only the questions of the syntactic type with exception of nationality adjectives which is already covered completely in Mikolov's test set.
We constructed the pairs more or less manually, taking inspiration in the Czech side of the CzEng corpus [3], where explicit morphological annotation allows to identify various pairs of Czech words (different grades of adjectives, words and their negations, etc.). The word-aligned English words often shared the same properties. Another sources of pairs were acquired from various webpages usually written for learners of English. For example for verb tense, we relied on a freely available list of English verbs and their morphological variations.
We have included 100-1000 different pairs for each question set. The questions were constructed from the pairs similarly as by Mikolov: generating all possible pairs of pairs. This leads to millions of questions, so we randomly selected 1000 instances per question set, to keep the test set in the same order of magnitude. Additionally, we decided to extend set of questions on opposites to cover not only opposites of adjectives but also of nouns and verbs.
This package contains the eye-tracker recordings of 8 subjects evaluating English-to-Czech machine translation quality using the WMT-style ranking of sentences.
We provide the set of sentences evaluated, the exact screens presented to the annotators (including bounding box information for every area of interest and even for individual letters in the text) and finally the raw EyeLink II files with gaze trajectories.
The description of the experiment can be found in the paper:
Ondřej Bojar, Filip Děchtěrenko, Maria Zelenina. A Pilot Eye-Tracking Study of WMT-Style Ranking Evaluation.
Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools
and Data Sets to an Integrated Ecosystem”, Georg Rehm, Aljoscha Burchardt et al. (eds.). pp. 20-26. May 2016, Portorož, Slovenia.
This work has received funding from the European Union's Horizon 2020 research
and innovation programme under grant agreement no. 645452 (QT21). This work was
partially financially supported by the Government of Russian Federation, Grant
074-U01.
This work has been using language resources developed, stored and distributed
by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of
the Czech Republic (project LM2010013).