« Previous |
1 - 10 of 228
|
Next »
Number of results to display per page
Search Results
2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
3. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents
- Creator:
- Novotný, Vít, Seidlová, Kristýna, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- image and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- German, Czech, Latin, and English
- Description:
- This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
5. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials
- Creator:
- Novotný, Vít and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
6. A Small Dataset for English-to-Czech Speech Translation in the Travel Domain
- Creator:
- Cífka, Ondřej and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- speech corpus, ASR, and machine translation
- Language:
- English and Czech
- Description:
- This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate). The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied. The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
7. A Speech Test Set of Practice Business Presentations with Additional Relevant Texts
- Creator:
- Macháček, Dominik, Kratochvíl, Jonáš, Vojtěchová, Tereza, and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- ASR, ASR evaluation, speech corpus, non-native English, speech recognition, speech recognition evaluation, speech and relevant texts, and European non-native English
- Language:
- English
- Description:
- We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for evaluation of automatic speech recognition (ASR) systems, especially in conditions where the prior availability of in-domain vocabulary and named entities is benefitable. The corpus consists of 39 presentations in English, each up to 90 seconds long, and slides and web-pages in Czech, Slovak, English, German, Romanian, Italian or Spanish. The speakers are high school students from European countries with English as their second language. We benchmark three baseline ASR systems on the corpus and show their imperfection.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
8. Air Traffic Control Communication
- Creator:
- Šmídl, Luboš
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- audio and corpus
- Subject:
- speech corpus and acoustic model
- Language:
- English
- Description:
- Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono. and Technology Agency of the Czech Republic, project No. TA01030476.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
9. Alex Context NLG Dataset
- Creator:
- Dušek, Ondřej and Jurčíček, Filip
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- dialogue system, natural language generation, dialogue alignment, and entrainment
- Language:
- English
- Description:
- A dataset intended for fully trainable natural language generation (NLG) systems in task-oriented spoken dialogue systems (SDS), covering the English public transport information domain. It includes preceding context (user utterance) along with each data instance (pair of source meaning representation and target natural language paraphrase to be generated). Taking the form of the previous user utterance into account for generating the system response allows NLG systems trained on this dataset to entrain (adapt) to the preceding utterance, i.e., reuse wording and syntactic structure. This should presumably improve the perceived naturalness of the output, and may even lead to a higher task success rate. Crowdsourcing has been used to obtain natural context user utterances as well as natural system responses to be generated.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
10. Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1)
- Creator:
- Ramisch, Carlos, Cordeiro, Silvio Ricardo, Savary, Agata, Vincze, Veronika, Barbu Mititelu, Verginica, Bhatia, Archna, Buljan, Maja, Candito, Marie, Gantar, Polona, Giouli, Voula, Güngör, Tunga, Hawwari, Abdelati, Iñurrieta, Uxoa, Kovalevskaitė, Jolanta, Krek, Simon, Lichte, Timm, Liebeskind, Chaya, Monti, Johanna, Parra Escartín, Carla, QasemiZadeh, Behrang, Ramisch, Renata, Schneider, Nathan, Stoyanova, Ivelina, Vaidya, Ashwini, Walsh, Abigail, Aceta, Cristina, Aduriz, Itziar, Antoine, Jean-Yves, Arhar Holdt, Špela, Berk, Gözde, Bielinskienė, Agnė, Blagus, Goranka, Boizou, Loic, Bonial, Claire, Caruso, Valeria, Čibej, Jaka, Constant, Matthieu, Cook, Paul, Diab, Mona, Dimitrova, Tsvetana, Ehren, Rafael, Elbadrashiny, Mohamed, Elyovich, Hevi, Erden, Berna, Estarrona, Ainara, Fotopoulou, Aggeliki, Foufi, Vassiliki, Geeraert, Kristina, van Gompel, Maarten, Gonzalez, Itziar, Gurrutxaga, Antton, Ha-Cohen Kerner, Yaakov, Ibrahim, Rehab, Ionescu, Mihaela, Jain, Kanishka, Jazbec, Ivo-Pavao, Kavčič, Teja, Klyueva, Natalia, Kocijan, Kristina, Kovács, Viktória, Kuzman, Taja, Leseva, Svetlozara, Ljubešić, Nikola, Malka, Ruth, Markantonatou, Stella, Martínez Alonso, Héctor, Matas, Ivana, McCrae, John, de Medeiros Caseli, Helena, Onofrei, Mihaela, Palka-Binkiewicz, Emilia, Papadelli, Stella, Parmentier, Yannick, Pascucci, Antonio, Pasquer, Caroline, Pia di Buono, Maria, Puri, Vandana, Raffone, Annalisa, Ratori, Shraddha, Riccio, Anna, Sangati, Federico, Shukla, Vishakha, Simkó, Katalin, Šnajder, Jan, Somers, Clarissa, Srivastava, Shubham, Stefanova, Valentina, Taslimipoor, Shiva, Theoxari, Natasa, Todorova, Maria, Urizar, Ruben, Villavicencio, Aline, and Zilio, Leonardo
- Publisher:
- PARSEME
- Type:
- text and corpus
- Subject:
- Multiword expressions, verbal multiword expressions, light-verb constructions, verb-particle constructions, inherently reflexive verbs, verbal idioms, and multi-verb constructions
- Language:
- Bulgarian, German, Modern Greek (1453-), Spanish, Persian, French, Hebrew, Hungarian, Italian, Lithuanian, Polish, Portuguese, Romanian, Slovenian, Turkish, Hindi, Basque, English, and Croatian
- Description:
- This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). VMWEs were annotated according to the universal guidelines in 19 languages. The corpora are provided in the cupt format, inspired by the CONLL-U format. The corpora were used in the 1.1 edition of the PARSEME Shared Task (2018). For most languages, morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.1 (2018). The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1
- Rights:
- PARSEME Shared Task Data (v. 1.1) Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.1, and PUB