« Previous |
1 - 20 of 495
|
Next »
Number of results to display per page
Search Results
2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
3. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents
- Creator:
- Novotný, Vít, Seidlová, Kristýna, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- image and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- German, Czech, Latin, and English
- Description:
- This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
5. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials
- Creator:
- Novotný, Vít and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
6. A morphological layer for the German part of the SMULTRON corpus
- Creator:
- Müller, Thomas, Schütze, Hinrich, Caratti, Francesca, and Recknagel, Arne
- Publisher:
- Center for Information and Language Processing, University of Munich
- Type:
- text and corpus
- Subject:
- morphology, morphological tagging, and PoS tagging
- Language:
- German
- Description:
- A morphological layer for the German part of the SMULTRON corpus. Layer was annotated according to the STTS tagset and the annotation guidelines of the Tiger corpus. Coordinator: Thomas Müller Annotators: Francesca Caratti, Arne Recknagel This distribution contains a morphological layer for the SMULTRON corpus [0]. The annotation process is described in : @InProceedings{mueller2015, author = {M\"uller, Thomas and Sch\"utze, Hinrich}, title = {Robust Morphological Tagging with Word Representations}, booktitle = {Proceedings of NAACL}, year = {2015}, } [0] http://www.cl.uzh.ch/research/parallelcorpora/paralleltreebanks/smultron_en.html
- Rights:
- Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB
7. A Small Dataset for English-to-Czech Speech Translation in the Travel Domain
- Creator:
- Cífka, Ondřej and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- speech corpus, ASR, and machine translation
- Language:
- English and Czech
- Description:
- This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate). The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied. The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
8. A Speech Test Set of Practice Business Presentations with Additional Relevant Texts
- Creator:
- Macháček, Dominik, Kratochvíl, Jonáš, Vojtěchová, Tereza, and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- ASR, ASR evaluation, speech corpus, non-native English, speech recognition, speech recognition evaluation, speech and relevant texts, and European non-native English
- Language:
- English
- Description:
- We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for evaluation of automatic speech recognition (ASR) systems, especially in conditions where the prior availability of in-domain vocabulary and named entities is benefitable. The corpus consists of 39 presentations in English, each up to 90 seconds long, and slides and web-pages in Czech, Slovak, English, German, Romanian, Italian or Spanish. The speakers are high school students from European countries with English as their second language. We benchmark three baseline ASR systems on the corpus and show their imperfection.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
9. Additional German-Czech reference translations of the WMT'11 test set
- Creator:
- Bojar, Ondřej, Zeman, Daniel, Dušek, Ondřej, Břečková, Jana, Farkačová, Hana, Grošpic, Pavel, Kačenová, Kristýna, Knechtová, Eva, Koubová, Anna, Lukavská, Jana, Nováková, Petra, and Petrdlíková, Jana
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- reference translation, German-Czech, and parallel corpus
- Language:
- German and Czech
- Description:
- Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved. and This project has been sponsored by the grants GAČR P406/11/1499 and EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
10. Addressed Arabic Phonetic Rules
- Creator:
- Mustafa, Ebtihal and Bouzoubaa, Karim
- Publisher:
- languages journal
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- phonetics and Arabic phonetic System.
- Language:
- Arabic
- Description:
- This xml file describes the Arabic phonetic constraints are to be applied on Arabic root. The first rule category lists the letters that may not occur in the same root, regardless of their order. The second category lists the letters that may not be used together in a root word with a specific order. The third and fourth categories show that each contiguous letters must not be redundant ISLRN: 991-445-325-823-5
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
11. AdjDeriNet: Words Derived from Adjectives in Czech
- Creator:
- Ševčíková, Magda and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- adjectives, derivation, word-formation, and derivational morphology
- Language:
- Czech
- Description:
- Lexical network AdjDeriNet consists of pairs of base adjectives and their derivatives. It contains nearly 18 thousand base adjectives that are base words for more than 26 thousand lexemes of several parts of speech.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
12. Air Traffic Control Communication
- Creator:
- Šmídl, Luboš
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- audio and corpus
- Subject:
- speech corpus and acoustic model
- Language:
- English
- Description:
- Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono. and Technology Agency of the Czech Republic, project No. TA01030476.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
13. AKCES 1
- Creator:
- Šebesta, Karel, Goláňová, Hana, Letafková, Jana, and Jelínková, Blanka
- Publisher:
- Charles University in Prague, ÚČJTK
- Type:
- text and corpus
- Subject:
- youth language and written language
- Language:
- Czech
- Description:
- Corpus AKCES 1 includes texts written in czech by youth (native speakers); it is the same data as the corpus SKRIPT 2012
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
14. AKCES 2
- Creator:
- Šebesta, Karel and Goláňová, Hana
- Publisher:
- Charles University in Prague, ÚČJTK
- Type:
- text and corpus
- Subject:
- youth language, classroom, language acquisition corpus, and AKCES
- Language:
- Czech
- Description:
- Corpus AKCES 2 consists of trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and MŠMT (MSM0021620825), UK (PRVOUK P 10)
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
15. AKCES 2 ver. 2
- Creator:
- Šebesta, Karel and Goláňová, Hana
- Publisher:
- Charles University in Prague, ÚČJTK
- Type:
- text and corpus
- Subject:
- youth language, classroom, language acquisition corpus, and AKCES
- Language:
- Czech
- Description:
- Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and UK, PRVOUK P10
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
16. AKCES 3
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Rosen, Alexandr, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Poláčková, Marie, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Šťastný, Klement, Sládek, Šimon, and Pierscieniak, Piotr
- Publisher:
- Charles University in Prague, ÚČJTK
- Type:
- text and corpus
- Subject:
- Czech as a foreign language, Czech language acquisition corpora, non-native speakers, AKCES, and second language aquisition
- Language:
- Czech
- Description:
- Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
17. AKCES 4
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Rosen, Alexandr, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Šťastný, Klement, and Sládek, Šimon
- Publisher:
- Charles University
- Type:
- text and corpus
- Subject:
- language of children, Czech language acquisition, adolescents, and AKCES
- Language:
- Czech
- Description:
- Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
18. AKCES 5 (CzeSL-SGT)
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Poláčková, Marie, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Richter, Michal, Straka, Milan, and Rosen, Alexandr
- Publisher:
- Charles University
- Type:
- text and corpus
- Subject:
- learner corpus, Czech as a foreign language, Czech language acquisition corpora, AKCES, non-native speakers, and second language aquisition
- Language:
- Czech
- Description:
- Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text.
- Rights:
- Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB
19. AKCES 5 (CzeSL-SGT) Release 2
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Poláčková, Marie, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Richter, Michal, Straka, Milan, and Rosen, Alexandr
- Publisher:
- Charles University
- Type:
- text and corpus
- Subject:
- learner corpus, Czech as a foreign language, Czech language acquisition corpora, AKCES, non-native speakers, and second language acquistion
- Language:
- Czech
- Description:
- Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text. In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes.
- Rights:
- Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB
20. AKCES-GEC Grammatical Error Correction Dataset for Czech
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Straka, Milan, Rosen, Alexandr, Náplava, Jakub, and Poláčková, Marie
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- natural language correction, grammatical error correction, and gec
- Language:
- Czech
- Description:
- AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences. If you use this dataset, please use following citation: @article{naplava2019wnut, title={Grammatical Error Correction in Low-Resource Scenarios}, author={N{\'a}plava, Jakub and Straka, Milan}, journal={arXiv preprint arXiv:1910.00353}, year={2019} }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB