A test set that contains manually annotated sentences with gapping.
The test set was compiled from SynTagRus (v. 2015) the dependency treebank for Russian that provides comprehensive manually-corrected morphological and syntactic annotation.
The presented data and metadata include answers to questions raised in the questionnaire focused on the experience of teaching practicums and their role in the practical preparation of English language teachers at the Faculty of Arts, Charles University, as well as a basic quantitative analysis of the answers.
The analysis of the questionnaires shows that trainees are, in most cases, prepared for their teaching practicum both professionally and in terms of pedagogy and psychology, and the use of reflective teaching methods seems very useful. The benefits of the teaching practicum include, in particular, getting to know the real situation of teaching in secondary schools and working with a larger group of pupils, getting to know oneself as a teacher, gaining self-confidence, and becoming aware of one's own limits and areas for improvement. The downsides of the current system of teaching practice include mainly the low time allocation, the lack of integration of the practice in the curriculum, and the lack of involvement of the trainee in the daily running of the school (administrative work, supervision, meetings) and the lack of quality feedback from the faculty teacher.
The ACL RD-TEC 2.0 has been developed with the aim of providing a benchmark for the evaluation of methods for terminology extraction and classification as well as entity recognition tasks based on specialised text from the computational linguistics domain. This release of the corpus consists of 300 abstracts from articles in the ACL Anthology Reference Corpus, published between 1978--2006. In these abstracts, terms (i.e., single or multi-word lexical units with a specialised meaning) are manually annotated. In addition to their boundaries in running text, annotated terms are classified into one of the seven categories method, tool, language resource (LR), LR product, model, measures and measurements, and other. To assess the quality of the annotations and to determine the difficulty of this task, more than 171 of the abstracts are annotated twice, independently, by each of the two annotators. In total, 6,818 terms are identified and annotated, resulting in a specialised vocabulary made of 3,318 lexical forms, mapped to 3,471 concepts.
This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the Welcome project (https://welcome-h2020.eu/). The corpus consists of 120,600 multiparallel sentences in English, French, German, Greek, Spanish, and Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation for North Levantine and the other languages.
VALLEX 3.0 provides information on the valency structure (combinatorial potential) of verbs in their particular senses, which are characterized by glosses and examples. VALLEX 3.0 describes almost 4 600 Czech verbs in more than 10 800 lexical units, i.e., given verbs in the given senses.
VALLEX 3.0 is a is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. In order to satisfy different needs of different potential users, the lexicon is distributed (i) in a HTML version (the data allows for an easy and fast navigation through the lexicon) and (ii) in a machine-tractable form as a single XML file, so that the VALLEX data can be used in NLP applications.
VALLEX 4.0 provides information on the valency structure (combinatorial potential) of verbs in their particular senses; each sense is by a gloss and examples. VALLEX 4.0 describes almost 4 700 Czech verbs in more than 11 000 lexical units, i.e., given verbs in the given senses. VALLEX 4.0 is a is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. In order to satisfy different needs of different potential users, the lexicon is distributed (i) in a HTML version (the data allows for an easy and fast navigation through the lexicon) and (ii) in a machine-tractable form, so that the VALLEX data can be used in NLP applications. VALLEX 4.0 provides (in addition to information from previous versions) also characteristics of verbs expressing reciprocity and reflexivity.
The data is provided in two formats: XML and JSON.
VALLEX 4.5 provides information on the valency structure (combinatorial potential) of Czech verbs in their particular senses (almost 4 700 verbs in more than 11 080 lexical units, supplemented with more than 290 nouns in more than 350 lexical units forming complex predicates with light verbs). VALLEX 4.5 is an enhanced successor of VALLEX 3.0, 3.5, and 4.0. In addition to the information stored there, VALLEX 4.5 provides a detailed description of reflexive verbs, i.e., verbs with the reflexive "se" or "si" as an obligatory part of their verb lexemes. VALLEX 4.5 covers 1 525 reflexive verbs in 1 545 lexical units (2 501 when aspectual counterparts counted separately). In order to satisfy different needs of different potential users, the lexicon is distributed (i) online in a HTML version (the data allows for an easy and fast navigation through the lexicon) and (ii) in this distribution in a machine-tractable form, so that the VALLEX data can be used in NLP applications.
We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness, this record contains the 3000 sentences in all the WMT 2013 original languages (Czech, English, French, German, Russian and Spanish), extended with our Vietnamese version. Test set is used in [2] to evaluate translation between Czech, English and Vietnamese.
References
1. http://www.statmt.org/wmt13/evaluation-task.html
2. Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75--86, ISSN 1804-0462. 9/2015
The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.