Language: English / Original context has metadata only: false / Rights: PUB

231. Video699: lecture recordings and lecture materials

Creator:: Novotný, Vít
Publisher:: Faculty of Informatics, Masaryk University
Type:: video and corpus
Subject:: information retrieval, video, image, and XML
Language:: English and Czech
Description:: This is an XML dataset of 17 lecture recordings randomly sampled from the lectures recorded at the Faculty of Informatics, Brno, Czechia during 2010–2016. We drew a stratified sample of up to 25 video frames from each recording. In each video frame, we annotated lit projection screens and their condition. For each lit projection screen, we annotated lecture materials shown in the screen. The dataset contains 699 projection screen annotations, and 925 lecture materials.
Rights:: Open Data Commons Open Database License (ODbL), http://opendatacommons.org/licenses/odbl/summary/, and PUB

232. VPS-30-En

Creator:: Cinková, Silvie, Holub, Martin, Rambousek, Adam, and Smejkalová, Lenka
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: corpus pattern analysis, clustering, lexical semantics, and verbs
Language:: English
Description:: VPS-30-En is a small lexical resource that contains the following 30 English verbs: access, ally, arrive, breathe, claim, cool, crush, cry, deny, enlarge, enlist, forge, furnish, hail, halt, part, plough, plug, pour, say, smash, smell, steer, submit, swell, tell, throw, trouble, wake and yield. We have created and have been using VPS-30-En to explore the interannotator agreement potential of the Corpus Pattern Analysis. VPS-30-En is a small snapshot of the Pattern Dictionary of English Verbs (Hanks and Pustejovsky, 2005), which we revised (both the entries and the annotated concordances) and enhanced with additional annotations. and This work has been partly supported by the Ministry of Education of CR within the LINDAT-Clarin project LM2010013, and by the Czech Science Foundation under the projects P103/12/G084, P406/2010/0875 and P401/10/0792.
Rights:: Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB

233. VPS-GradeUp (2016-10-10)

Creator:: Baisa, Vít, Cinková, Silvie, Krejčová, Ema, and Vernerová, Anna
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, other, and lexicalConceptualResource
Subject:: Pattern Dictionary of English Verbs, usage patterns, lexical semantics, dictionaries, clustering, Corpus Pattern Analysis, verbs, graded decisions, Likert scale, and Word Sense Disambiguation
Language:: English
Description:: VPS-GradeUp is a collection of triple manual annotations of 29 English verbs based on the Pattern Dictionary of English Verbs (PDEV) and comprising the following lemmas: abolish, act, adjust, advance, answer, approve, bid, cancel, conceive, cultivate, cure, distinguish, embrace, execute, hire, last, manage, murder, need, pack, plan, point, praise, prescribe, sail, seal, see, talk, urge . It contains results from two different tasks: 1. Graded decisions 2. Best-fit pattern (WSD) . In both tasks, the annotators were matching verb senses defined by the PDEV patterns with 50 actual uses of each verb (using concordances from the BNC [2]). The verbs were randomly selected from a list of completed PDEV lemmas with at least 3 patterns and at least 100 BNC concordances not previously annotated by PDEV’s own annotators. Also, the selection excluded verbs contained in VPS-30-En[3], a data set we developed earlier. This data set was built within the project Reviving Zellig S. Harris: more linguistic information for distributional lexical analysis of English and Czech and in connection with the SemEval-2015 CPA-related task.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

234. Vystadial 2013 – English data

Creator:: Korvas, Matěj, Plátek, Ondřej, Dušek, Ondřej, Žilka, Lukáš, and Jurčíček, Filip
Publisher:: Charles University, Faculty of Mathematics and Physics
Type:: audio and corpus
Subject:: acoustic data, speech corpus, spoken corpus, orthographic transcriptions, telephone speech, voip, and dialogue system
Language:: English
Description:: Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts. The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits. This is the English data part of the dataset. and This research was funded by the Ministry of Education, Youth and Sports of the Czech Republic under the grant agreement LK11221.
Rights:: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB

235. Vystadial 2013 – scripts

Creator:: Korvas, Matěj, Plátek, Ondřej, Dušek, Ondřej, Žilka, Lukáš, and Jurčíček, Filip
Publisher:: Charles University, Faculty of Mathematics and Physics
Type:: toolService and tool
Subject:: ASR, HTK, Kaldi, and acoustic model
Language:: English and Czech
Description:: Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts. The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits. This is the scripts part of the dataset. and This research was funded by the Ministry of Education, Youth and Sports of the Czech Republic under the grant agreement LK11221.
Rights:: Apache License 2.0, http://opensource.org/licenses/Apache-2.0, and PUB

236. W2C – Web to Corpus – Corpora

Creator:: Majliš, Martin
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: multilingual corpora
Language:: Afrikaans, Tosk Albanian, Amharic, Arabic, Aragonese, Egyptian Arabic, Asturian, Azerbaijani, Belarusian, Bengali, Bosnian, Bishnupriya, Breton, Buginese, Bulgarian, Catalan, Cebuano, Czech, Chuvash, Corsican, Welsh, Danish, German, Dimli (individual language), Modern Greek (1453-), English, Esperanto, Estonian, Basque, Faroese, Persian, Finnish, French, Western Frisian, Gan Chinese, Scottish Gaelic, Irish, Galician, Gilaki, Gujarati, Haitian, Serbo-Croatian, Hebrew, Fiji Hindi, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Ido, Interlingua (International Auxiliary Language Association), Indonesian, Icelandic, Italian, Javanese, Japanese, Kannada, Georgian, Kazakh, Korean, Kurdish, Latin, Latvian, Limburgan, Lithuanian, Lombard, Luxembourgish, Malayalam, Marathi, Macedonian, Malagasy, Mongolian, Maori, Malay (macrolanguage), Burmese, Neapolitan, Low German, Nepali (macrolanguage), Newari, Dutch, Norwegian Nynorsk, Norwegian, Occitan (post 1500), Ossetian, Pampanga, Piemontese, Polish, Portuguese, Quechua, Romanian, Russian, Yakut, Sicilian, Scots, Slovak, Slovenian, Spanish, Albanian, Serbian, Sundanese, Swahili (macrolanguage), Swedish, Tamil, Tatar, Telugu, Tajik, Tagalog, Thai, Turkish, Ukrainian, Urdu, Uzbek, Venetian, Vietnamese, Volapük, Waray (Philippines), Walloon, Yiddish, Yoruba, and Chinese
Description:: A set of corpora for 120 languages automatically collected from wikipedia and the web. Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
Rights:: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB

237. WMT 13 Test Set

Creator:: Hoang, Duc Tam and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: test data, parallel corpus, and Vietnamese
Language:: Vietnamese, Czech, English, German, French, Spanish, and Russian
Description:: We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness, this record contains the 3000 sentences in all the WMT 2013 original languages (Czech, English, French, German, Russian and Spanish), extended with our Vietnamese version. Test set is used in [2] to evaluate translation between Czech, English and Vietnamese. References 1. http://www.statmt.org/wmt13/evaluation-task.html 2. Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75--86, ISSN 1804-0462. 9/2015
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

238. WMT 2011 Testing Set

Creator:: Galuščáková, Petra and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: WMT, test data, and Slovak
Language:: Slovak, Czech, and English
Description:: Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2]. References: [1] http://www.statmt.org/wmt11/evaluation-task.html [2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press. and The work on this project was supported by the grant EuroMatrixPlus (FP7-ICT- 2007-3-231720 of the EU and 7E09003 of the Czech Republic)
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

239. WMT16 APE Shared Task Data

Creator:: Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
Publisher:: Fondazione Bruno Kessler, Trento, Italy
Type:: text and corpus
Subject:: machine translation, machine learning, automatic postediting, and shared task
Language:: English and German
Description:: Training, development and text data (the same used for the Sentence-level Quality Estimation task) consist in English-German triplets (source, target and post-edit) belonging to the IT domain and already tokenized. Training and development respectively contain 12,000 and 1,000 triplets, while the test set 2,000 instances. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Rights:: AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

240. WMT16 Quality Estimation Shared Task Training and Development Data

Creator:: Specia, Lucia, Logacheva, Varvara, and Scarton, Carolina
Publisher:: University of Sheffield
Type:: text and corpus
Subject:: machine translation, quality estimation, and machine learning
Language:: English and German
Description:: Training and development data for the WMT16 QE task. Test data will be published as a separate item. This shared task will build on its previous four editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, sentence-level and document-level estimation. The sentence and word-level tasks will explore a large dataset produced from post-editions by professional translators (as opposed to crowdsourced translations as in the previous year). For the first time, the data will be domain-specific (IT domain). The document-level task will use, for the first time, entire documents, which have been human annotated for quality indirectly in two ways: through reading comprehension tests and through a two-stage post-editing exercise. Our tasks have the following goals: - To advance work on sentence and word-level quality estimation by providing domain-specific, larger and professionally annotated datasets. - To study the utility of detailed information logged during post-editing (time, keystrokes, actual edits) for different levels of prediction. - To analyse the effectiveness of different types of quality labels provided by humans for longer texts in document-level prediction. This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for the sentence and word-level tasks, and multiple MT systems were used to produce translations for the document-level task. Therefore, MT system-dependent information will be made available where possible.
Rights:: AGREEMENT ON THE USE OF DATA IN QT21, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB

231. Video699: lecture recordings and lecture materials

232. VPS-30-En

233. VPS-GradeUp (2016-10-10)

234. Vystadial 2013 – English data

235. Vystadial 2013 – scripts

236. W2C – Web to Corpus – Corpora

237. WMT 13 Test Set

238. WMT 2011 Testing Set

239. WMT16 APE Shared Task Data

240. WMT16 Quality Estimation Shared Task Training and Development Data

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from