« Previous |
1 - 50 of 740
|
Next »
Number of results to display per page
Search Results
2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
3. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents
- Creator:
- Novotný, Vít, Seidlová, Kristýna, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- image and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- German, Czech, Latin, and English
- Description:
- This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
5. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials
- Creator:
- Novotný, Vít and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
6. A morphological layer for the German part of the SMULTRON corpus
- Creator:
- Müller, Thomas, Schütze, Hinrich, Caratti, Francesca, and Recknagel, Arne
- Publisher:
- Center for Information and Language Processing, University of Munich
- Type:
- text and corpus
- Subject:
- morphology, morphological tagging, and PoS tagging
- Language:
- German
- Description:
- A morphological layer for the German part of the SMULTRON corpus. Layer was annotated according to the STTS tagset and the annotation guidelines of the Tiger corpus. Coordinator: Thomas Müller Annotators: Francesca Caratti, Arne Recknagel This distribution contains a morphological layer for the SMULTRON corpus [0]. The annotation process is described in : @InProceedings{mueller2015, author = {M\"uller, Thomas and Sch\"utze, Hinrich}, title = {Robust Morphological Tagging with Word Representations}, booktitle = {Proceedings of NAACL}, year = {2015}, } [0] http://www.cl.uzh.ch/research/parallelcorpora/paralleltreebanks/smultron_en.html
- Rights:
- Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB
7. A Small Dataset for English-to-Czech Speech Translation in the Travel Domain
- Creator:
- Cífka, Ondřej and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- speech corpus, ASR, and machine translation
- Language:
- English and Czech
- Description:
- This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate). The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied. The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
8. A Speech Test Set of Practice Business Presentations with Additional Relevant Texts
- Creator:
- Macháček, Dominik, Kratochvíl, Jonáš, Vojtěchová, Tereza, and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- ASR, ASR evaluation, speech corpus, non-native English, speech recognition, speech recognition evaluation, speech and relevant texts, and European non-native English
- Language:
- English
- Description:
- We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for evaluation of automatic speech recognition (ASR) systems, especially in conditions where the prior availability of in-domain vocabulary and named entities is benefitable. The corpus consists of 39 presentations in English, each up to 90 seconds long, and slides and web-pages in Czech, Slovak, English, German, Romanian, Italian or Spanish. The speakers are high school students from European countries with English as their second language. We benchmark three baseline ASR systems on the corpus and show their imperfection.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
9. Additional German-Czech reference translations of the WMT'11 test set
- Creator:
- Bojar, Ondřej, Zeman, Daniel, Dušek, Ondřej, Břečková, Jana, Farkačová, Hana, Grošpic, Pavel, Kačenová, Kristýna, Knechtová, Eva, Koubová, Anna, Lukavská, Jana, Nováková, Petra, and Petrdlíková, Jana
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- reference translation, German-Czech, and parallel corpus
- Language:
- German and Czech
- Description:
- Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved. and This project has been sponsored by the grants GAČR P406/11/1499 and EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic)
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
10. Aging effects in an evolving phonological network
- Creator:
- Luef, Eva Maria
- Publisher:
- Charles University
- Type:
- text and corpus
- Subject:
- network aging, English as a second language, network evolution, phonological network, and preferential attachment
- Language:
- English
- Description:
- Phonological networks are representations of word forms and their phonological relationships with other words in a given language lexicon. A principle underlying the growth (or evolution) of those networks is preferential attachment, or the ‘rich-gets-richer’ mechanisms, according to which words with many phonological neighbors (or links) are the main beneficiaries of future growth opportunities. Due to their limited number of words, language lexica constitute node-constrained networks where growth cannot keep increasing in a linear way; hence, preferential attachment is likely mitigated by certain factors. The present study investigated aging effects (i.e., a word’s finite time span of being active in terms of growth) in an evolving phonological network of English as a second language. It was found that phonological neighborhoods are constructed by one large initial lexical spurt, followed by sublinear growth spurts that eventually lead to very limited growth in later lexical spurts during network evolution, all the while obeying the law of preferential attachment. An analysis of the strength of phonological relationships between phonological word forms revealed a tendency to attach more distant phonological neighbors in the lower proficiency levels, while phonologically more similar neighbors enter phonological neighborhoods at more advanced levels of English as a second language. Overall, the findings suggest an aging effect in growth that favors younger words. In addition, beginning learners seem to prefer the acquisition of phonological neighbors that are easier to discriminate. Implications for the second language lexicon include leveraged learning mechanisms, learning bouts focussed on a smaller range of phonological segments, and involve questions concerning lexical processing in aging networks.
- Rights:
- Creative Commons - Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0), http://creativecommons.org/licenses/by-nd/4.0/, and PUB
11. Air Traffic Control Communication
- Creator:
- Šmídl, Luboš
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- audio and corpus
- Subject:
- speech corpus and acoustic model
- Language:
- English
- Description:
- Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono. and Technology Agency of the Czech Republic, project No. TA01030476.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
12. AKCES 1
- Creator:
- Šebesta, Karel, Goláňová, Hana, Letafková, Jana, and Jelínková, Blanka
- Publisher:
- Charles University in Prague, ÚČJTK
- Type:
- text and corpus
- Subject:
- youth language and written language
- Language:
- Czech
- Description:
- Corpus AKCES 1 includes texts written in czech by youth (native speakers); it is the same data as the corpus SKRIPT 2012
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
13. AKCES 2
- Creator:
- Šebesta, Karel and Goláňová, Hana
- Publisher:
- Charles University in Prague, ÚČJTK
- Type:
- text and corpus
- Subject:
- youth language, classroom, language acquisition corpus, and AKCES
- Language:
- Czech
- Description:
- Corpus AKCES 2 consists of trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and MŠMT (MSM0021620825), UK (PRVOUK P 10)
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
14. AKCES 2 ver. 2
- Creator:
- Šebesta, Karel and Goláňová, Hana
- Publisher:
- Charles University in Prague, ÚČJTK
- Type:
- text and corpus
- Subject:
- youth language, classroom, language acquisition corpus, and AKCES
- Language:
- Czech
- Description:
- Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and UK, PRVOUK P10
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
15. AKCES 3
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Rosen, Alexandr, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Poláčková, Marie, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Šťastný, Klement, Sládek, Šimon, and Pierscieniak, Piotr
- Publisher:
- Charles University in Prague, ÚČJTK
- Type:
- text and corpus
- Subject:
- Czech as a foreign language, Czech language acquisition corpora, non-native speakers, AKCES, and second language aquisition
- Language:
- Czech
- Description:
- Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
16. AKCES 4
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Rosen, Alexandr, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Šťastný, Klement, and Sládek, Šimon
- Publisher:
- Charles University
- Type:
- text and corpus
- Subject:
- language of children, Czech language acquisition, adolescents, and AKCES
- Language:
- Czech
- Description:
- Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
- Rights:
- Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB
17. AKCES 5 (CzeSL-SGT)
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Poláčková, Marie, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Richter, Michal, Straka, Milan, and Rosen, Alexandr
- Publisher:
- Charles University
- Type:
- text and corpus
- Subject:
- learner corpus, Czech as a foreign language, Czech language acquisition corpora, AKCES, non-native speakers, and second language aquisition
- Language:
- Czech
- Description:
- Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text.
- Rights:
- Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB
18. AKCES 5 (CzeSL-SGT) Release 2
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Poláčková, Marie, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Richter, Michal, Straka, Milan, and Rosen, Alexandr
- Publisher:
- Charles University
- Type:
- text and corpus
- Subject:
- learner corpus, Czech as a foreign language, Czech language acquisition corpora, AKCES, non-native speakers, and second language acquistion
- Language:
- Czech
- Description:
- Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text. In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes.
- Rights:
- Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB
19. AKCES-GEC Grammatical Error Correction Dataset for Czech
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Štindlová, Barbora, Hrdlička, Milan, Hrdličková, Tereza, Hana, Jiří, Petkevič, Vladimír, Jelínek, Tomáš, Škodová, Svatava, Janeš, Petr, Lundáková, Kateřina, Skoumalová, Hana, Sládek, Šimon, Pierscieniak, Piotr, Toufarová, Dagmar, Straka, Milan, Rosen, Alexandr, Náplava, Jakub, and Poláčková, Marie
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- natural language correction, grammatical error correction, and gec
- Language:
- Czech
- Description:
- AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences. If you use this dataset, please use following citation: @article{naplava2019wnut, title={Grammatical Error Correction in Low-Resource Scenarios}, author={N{\'a}plava, Jakub and Straka, Milan}, journal={arXiv preprint arXiv:1910.00353}, year={2019} }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
20. AlbMoRe Movie Reviews in Albanian
- Creator:
- Çano, Erion
- Publisher:
- University of Vienna
- Type:
- text and corpus
- Subject:
- sentiment analysis, under-resourced language, and albanian language
- Language:
- Albanian
- Description:
- AlbMoRe is a sentiment analysis corpus of movie reviews in Albanian, consisting of 800 records in CSV format. Each record includes a text review retrieved from IMDb and translated in Albanian by the author. It also contains a 0 negative) or 1 (positive) label added by the author. The corpus is fully balanced, consisting of 400 positive and 400 negative reviews about 67 movies of different genres. AlbMoRe corpus is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper: Çano Erion. AlbMoRe: A Corpus of Movie Reviews for Sentiment Analysis in Albanian. CoRR, abs/2306.08526, 2023. URL https://arxiv.org/abs/2306.08526.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), PUB, and http://creativecommons.org/licenses/by/4.0/
21. AlbNER Named Entity Recognition in Albanian
- Creator:
- Çano, Erion
- Publisher:
- University of Vienna
- Type:
- text and corpus
- Subject:
- named entity recognition, under-resourced languages, and albanian language
- Language:
- Albanian
- Description:
- AlbNER is a Named Entity Recognition corpus of Wikipedia sentences in Albanian, consisting of 900 records. The sentence tokens are manually labeled complying with the CoNLL-2003 shared task annotation scheme explained at https://aclanthology.org/W03-0419.pdf that uses I-ORG, B-ORG, I-PER, B-PER, I-LOC, B-LOC, I-MISC, B-MISC and O tags. AlbNER data are released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using AlbMoRe corpus, please cite the following paper: Çano Erion. AlbNER: A Corpus for Named Entity Recognition in Albanian. CoRR, abs/2309.08741, 2023. URL https://arxiv.org/abs/2309.08741.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
22. AlbNews Albanian Topic Modeling
- Creator:
- Çano, Erion
- Publisher:
- University of Vienna
- Type:
- text and corpus
- Subject:
- under-resourced language, albanian language, and topic modeling
- Language:
- Albanian
- Description:
- AlbNews is a topic modeling corpus of news headlines in Albanian, consisting of 600 labeled samples and 2600 unlabeled samples. Each labeled sample includes a headline text retrieved from Albanian online news portals. It also contains one of the four labels: 'pol' for politics, 'cul' for culture, 'eco' for economy, and 'spo' for sport. Each of the unlabeled samples contain a headline text only.AlbTopic corpus is released under CC-BY 4.0 license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper: Çano Erion, Lamaj Dario. AlbNews: A Corpus of Headlines for Topic Modeling in Albanian. CoRR, abs/2402.04028, 2024. URL: https://arxiv.org/abs/2402.04028.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
23. Alex Context NLG Dataset
- Creator:
- Dušek, Ondřej and Jurčíček, Filip
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- dialogue system, natural language generation, dialogue alignment, and entrainment
- Language:
- English
- Description:
- A dataset intended for fully trainable natural language generation (NLG) systems in task-oriented spoken dialogue systems (SDS), covering the English public transport information domain. It includes preceding context (user utterance) along with each data instance (pair of source meaning representation and target natural language paraphrase to be generated). Taking the form of the previous user utterance into account for generating the system response allows NLG systems trained on this dataset to entrain (adapt) to the preceding utterance, i.e., reuse wording and syntactic structure. This should presumably improve the perceived naturalness of the output, and may even lead to a higher task success rate. Crowdsourcing has been used to obtain natural context user utterances as well as natural system responses to be generated.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
24. Alpino Treebank
- Publisher:
- Center for Language and Cognition
- Format:
- application/xml
- Type:
- corpus
- Language:
- Dutch
- Description:
- A database of 7.000 syntactically analyzed Dutch sentences.
- Rights:
- Not specified
25. ALTWEB
- Type:
- corpus
- Language:
- Italian
- Description:
- Dialect (Tuscan); 380.000 entries; written; DBT tagset
- Rights:
- Not specified
26. Amara - universal subtitles
- Type:
- corpus
- Language:
- Arabic, Danish, Dutch, English, German, Modern Greek (1453-), Italian, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish
- Description:
- Large set of subtitles available for download in multiple languages. Can be used as parallel corpus.
- Rights:
- Not specified
27. Amharic Web Corpus
- Creator:
- Suchomel, Vít and Rychlý, Pavel
- Publisher:
- Masaryk University, NLP Centre
- Type:
- text and corpus
- Subject:
- Amharic, text corpus, Web corpus, under-resourced language, corpus annotation, and morphological tagger
- Language:
- Amharic
- Description:
- Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC corpus.
- Rights:
- NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA
28. Amharic WIC Corpus
- Creator:
- Rychlý, Pavel
- Publisher:
- Masaryk University, NLP Centre
- Type:
- text and corpus
- Subject:
- text corpora, Ethiopian languages, web corpora, under-resourced languages, and Amharic
- Language:
- Amharic
- Description:
- Substantially cleaned version of existing morphologically annotated WIC Corpus.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
29. Anglos-Saxon charters
- Publisher:
- King's College London
- Format:
- application/tei+xml
- Type:
- corpus
- Language:
- English
- Description:
- Charters written in Anglo-Saxon England before A.D. 900, marked-up in TEI XML. Browsable online.
- Rights:
- Not specified
30. Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.0)
- Creator:
- Savary, Agata, Ramisch, Carlos, Cordeiro, Silvio Ricardo, Sangati, Federico, Vincze, Veronika, QasemiZadeh, Behrang, Candito, Marie, Cap, Fabienne, Giouli, Voula, Stoyanova, Ivelina, Doucet, Antoine, Adalı, Kübra, Barbu Mititelu, Verginica, Bejček, Eduard, El Maarouf, Ismail, Eryiğit, Gülşen, Galea, Luke, Ha-Cohen Kerner, Yaakov, Liebeskind, Chaya, Monti, Johanna, Parra Escartín, Carla, Kovalevskaitė, Jolanta, Krek, Simon, van der Plas, Lonneke, Aceta, Cristina, Aduriz, Itziar, Antoine, Jean-Yves, Attard, Greta, Azzopardi, Kirsty, Boizou, Loic, Bonnici, Janice, Boz, Mert, Bumbulienė, Ieva, Busuttil, Jael, Caruso, Valeria, Cherchi, Manuela, Constant, Matthieu, Czerepowicka, Monika, De Santis, Anna, Dimitrova, Tsvetana, Dinç, Tutkum, Elyovich, Hevi, Fabri, Ray, Farrugia, Alison, Findlay, Jamie, Fotopoulou, Aggeliki, Foufi, Vassiliki, Galea, Sara Anne, Gantar, Polona, Gatt, Albert, Gatt, Anabelle, Herrero, Carlos, Iñurrieta, Uxoa, Jagfeld, Glorianna, Hnátková, Milena, Ionescu, Mihaela, Klyueva, Natalia, Koeva, Svetla, Kovács, Viktória, Kuzman, Taja, Leseva, Svetlozara, Louisou, Sevi, Lynn, Teresa, Malka, Ruth, Martínez Alonso, Héctor, McCrae, John, de Medeiros Caseli, Helena, Miral, Ayşenur, Muscat, Amanda, Nivre, Joakim, Oakes, Michael, Onofrei, Mihaela, Parmentier, Yannick, Pasquer, Caroline, Pia di Buono, Maria, Priego Sanchez, Belem, Raffone, Annalisa, Ramisch, Renata, Rimkutė, Erika, Rizea, Monica-Mihaela, Simkó, Katalin, Spagnol, Michael, Stefanova, Valentina, Stymne, Sara, Sulubacak, Umut, Tabone, Nicole, Tanti, Marc, Todorova, Maria, Urešová, Zdenka, Villavicencio, Aline, and Zilio, Leonardo
- Publisher:
- PARSEME
- Type:
- text and corpus
- Subject:
- Multiword expressions, verbal multiword expressions, idioms, light-verb constructions, verb-particle constructions, and inherently reflexive verbs
- Language:
- Bulgarian, Czech, German, Modern Greek (1453-), Spanish, Persian, French, Hebrew, Hungarian, Italian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovenian, Swedish, and Turkish
- Description:
- The PARSEME shared task aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider 'to suicide' in French). VMWEs were annotated according to the universal guidelines in 18 languages. The corpora are provided in the parsemetsv format, inspired by the CONLL-U format. For most languages, paired files in the CONLL-U format - not necessarily using UD tagsets - containing parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). This item contains training and test data, tools and the universal guidelines file.
- Rights:
- PARSEME Shared Task Data (v. 1.0) Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.0, and PUB
31. Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1)
- Creator:
- Ramisch, Carlos, Cordeiro, Silvio Ricardo, Savary, Agata, Vincze, Veronika, Barbu Mititelu, Verginica, Bhatia, Archna, Buljan, Maja, Candito, Marie, Gantar, Polona, Giouli, Voula, Güngör, Tunga, Hawwari, Abdelati, Iñurrieta, Uxoa, Kovalevskaitė, Jolanta, Krek, Simon, Lichte, Timm, Liebeskind, Chaya, Monti, Johanna, Parra Escartín, Carla, QasemiZadeh, Behrang, Ramisch, Renata, Schneider, Nathan, Stoyanova, Ivelina, Vaidya, Ashwini, Walsh, Abigail, Aceta, Cristina, Aduriz, Itziar, Antoine, Jean-Yves, Arhar Holdt, Špela, Berk, Gözde, Bielinskienė, Agnė, Blagus, Goranka, Boizou, Loic, Bonial, Claire, Caruso, Valeria, Čibej, Jaka, Constant, Matthieu, Cook, Paul, Diab, Mona, Dimitrova, Tsvetana, Ehren, Rafael, Elbadrashiny, Mohamed, Elyovich, Hevi, Erden, Berna, Estarrona, Ainara, Fotopoulou, Aggeliki, Foufi, Vassiliki, Geeraert, Kristina, van Gompel, Maarten, Gonzalez, Itziar, Gurrutxaga, Antton, Ha-Cohen Kerner, Yaakov, Ibrahim, Rehab, Ionescu, Mihaela, Jain, Kanishka, Jazbec, Ivo-Pavao, Kavčič, Teja, Klyueva, Natalia, Kocijan, Kristina, Kovács, Viktória, Kuzman, Taja, Leseva, Svetlozara, Ljubešić, Nikola, Malka, Ruth, Markantonatou, Stella, Martínez Alonso, Héctor, Matas, Ivana, McCrae, John, de Medeiros Caseli, Helena, Onofrei, Mihaela, Palka-Binkiewicz, Emilia, Papadelli, Stella, Parmentier, Yannick, Pascucci, Antonio, Pasquer, Caroline, Pia di Buono, Maria, Puri, Vandana, Raffone, Annalisa, Ratori, Shraddha, Riccio, Anna, Sangati, Federico, Shukla, Vishakha, Simkó, Katalin, Šnajder, Jan, Somers, Clarissa, Srivastava, Shubham, Stefanova, Valentina, Taslimipoor, Shiva, Theoxari, Natasa, Todorova, Maria, Urizar, Ruben, Villavicencio, Aline, and Zilio, Leonardo
- Publisher:
- PARSEME
- Type:
- text and corpus
- Subject:
- Multiword expressions, verbal multiword expressions, light-verb constructions, verb-particle constructions, inherently reflexive verbs, verbal idioms, and multi-verb constructions
- Language:
- Bulgarian, German, Modern Greek (1453-), Spanish, Persian, French, Hebrew, Hungarian, Italian, Lithuanian, Polish, Portuguese, Romanian, Slovenian, Turkish, Hindi, Basque, English, and Croatian
- Description:
- This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). VMWEs were annotated according to the universal guidelines in 19 languages. The corpora are provided in the cupt format, inspired by the CONLL-U format. The corpora were used in the 1.1 edition of the PARSEME Shared Task (2018). For most languages, morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.1 (2018). The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1
- Rights:
- PARSEME Shared Task Data (v. 1.1) Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.1, and PUB
32. Annotated corpora and tools of the PARSEME Shared Task on Semi-Supervised Identification of Verbal Multiword Expressions (edition 1.2)
- Creator:
- Ramisch, Carlos, Guillaume, Bruno, Savary, Agata, Waszczuk, Jakub, Candito, Marie, Vaidya, Ashwini, Barbu Mititelu, Verginica, Bhatia, Archna, Iñurrieta, Uxoa, Giouli, Voula, Güngör, Tunga, Jiang, Menghan, Lichte, Timm, Liebeskind, Chaya, Monti, Johanna, Ramisch, Renata, Stymme, Sara, Walsh, Abigail, Xu, Hongzhi, Palka-Binkiewicz, Emilia, Ehren, Rafael, Stymne, Sara, Constant, Matthieu, Pasquer, Caroline, Parmentier, Yannick, Antoine, Jean-Yves, Carlino, Carola, Caruso, Valeria, Di Buono, Maria Pia, Pascucci, Antonio, Raffone, Annalisa, Riccio, Anna, Sangati, Federico, Speranza, Giulia, Cordeiro, Silvio Ricardo, de Medeiros Caseli, Helena, Miranda, Isaac, Rademaker, Alexandre, Vale, Oto, Villavicencio, Aline, Wick Pedro, Gabriela, Wilkens, Rodrigo, Zilio, Leonardo, Rizea, Monica-Mihaela, Ionescu, Mihaela, Onofrei, Mihaela, Chen, Jia, Ge, Xiaomin, Hu, Fangyuan, Hu, Sha, Li, Minli, Liu, Siyuan, Qin, Zhenzhen, Sun, Ruilong, Wang, Chenweng, Xiao, Huangyang, Yan, Peiyi, Yih, Tsy, Yu, Ke, Yu, Songping, Zeng, Si, Zhang, Yongchen, Zhao, Yun, Foufi, Vassiliki, Fotopoulou, Aggeliki, Markantonatou, Stella, Papadelli, Stella, Louizou, Sevasti, Aduriz, Itziar, Estarrona, Ainara, Gonzalez, Itziar, Gurrutxaga, Antton, Uria, Larraitz, Urizar, Ruben, Foster, Jennifer, Lynn, Teresa, Elyovitch, Hevi, Ha-Cohen Kerner, Yaakov, Malka, Ruth, Jain, Kanishka, Puri, Vandana, Ratori, Shraddha, Shukla, Vishakha, Srivastava, Shubham, Berk, Gozde, Erden, Berna, and Yirmibeşoğlu, Zeynep
- Publisher:
- PARSEME
- Type:
- text and corpus
- Subject:
- multiword expressions, verbal multiword expressions, light verb construction, verb-particle constructions, inherently reflexive verbs, verbal idioms, and multi-verb constructions
- Language:
- German, Modern Greek (1453-), Basque, French, Irish, Hebrew, Hindi, Italian, Polish, Portuguese, Romanian, Swedish, Turkish, and Chinese
- Description:
- This multilingual resource contains corpora in which verbal MWEs have been manually annotated, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). For the 1.2 shared task edition, the data covers 14 languages, for which VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format. Morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.2 (2020). The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.2
- Rights:
- PARSEME Shared Task Data (v. 1.2) Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.2, and PUB
33. Annotated Corpus of Czech Case Law for Reference Recognition Tasks
- Creator:
- Harašta, Jakub, Šavelka, Jaromír, Kasl, František, Kotková, Adéla, Loutocký, Pavel, Míšek, Jakub, Procházková, Daniela, Pullmannová, Helena, Semenišín, Petr, Šejnová, Tamara, Šimková, Nikola, Vosinek, Michal, Zavadilová, Lucie, and Zibner, Jan
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- reference recognition and legal texts
- Language:
- Czech
- Description:
- Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore dataset contains all original annotations. Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
34. Annotated Corpus of Czech Case Law for Reference Recognition Tasks (2019-06-25)
- Creator:
- Harašta, Jakub, Šavelka, Jaromír, Kasl, František, Kotková, Adéla, Loutocký, Pavel, Míšek, Jakub, Procházková, Daniela, Pullmannová, Helena, Semenišín, Petr, Šejnová, Tamara, Šimková, Nikola, Vosinek, Michal, Zavadilová, Lucie, and Zibner, Jan
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- reference recognition and legal texts
- Language:
- Czech
- Description:
- Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore corpus (raw) contains all original annotations. Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
35. Annotated Corpus of Czech Case Law for Segmentation Tasks
- Creator:
- Harašta, Jakub, Šavelka, Jaromír, Kasl, František, and Míšek, Jakub
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- document segmentation and legal texts
- Language:
- Czech
- Description:
- Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). 280 decisions were annotated by one trained annotator and then manually adjudicated by one trained curator. 70 decisions were annotated by two trained annotators and then manually adjudicated by one trained curator. Adjudication was conducted destructively, therefore dataset contains only the correct annotations and does not contain all original annotations. Corpus was developed as training and testing material for text segmentation tasks. Dataset contains decision segmented into Header, Procedural History, Submission/Rejoinder, Court Argumentation, Footer, Footnotes, and Dissenting Opinion. Segmentation allows to treat different parts of text differently even if it contains similar linguistic or other features.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
36. Annotation of Dramatic Situations in Theater Play Scripts
- Creator:
- Mareček, David, Nováková, Marie, Vosecká, Klára, Doležal, Josef, and Rosa, Rudolf
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and The Academy of Performing Arts in Prague, Theatre Faculty (DAMU)
- Type:
- text and corpus
- Subject:
- theatre, play script, and dramatic situation
- Language:
- Czech
- Description:
- We defined 58 dramatic situations and annotated them in 19 play scripts. Then we selected only 5 well-recognized dramatic situations and annotated further 33 play scripts. In this version of the data, we release only play scripts that can be freely distributed, which is 9 play scripts. One play is annotated independently by three annotators.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
37. Annotation of Dramatic Situations in Theater Play Scripts (2023)
- Creator:
- Mareček, David, Nováková, Marie, Vosecká, Klára, Doležal, Josef, and Rosa, Rudolf
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and The Academy of Performing Arts in Prague, Theatre Faculty (DAMU)
- Type:
- text and corpus
- Subject:
- theatre, play script, and dramatic situation
- Language:
- Czech
- Description:
- We defined 58 dramatic situations and annotated them in 19 play scripts. Then we selected only 5 well-recognized dramatic situations and annotated further 33 play scripts. In the previous (first) version, we released 9 play scripts that could be freely distributed. In this (second) version of the data, we are adding another 10 plays for which we have obtained licenses from authors. In total, there are 19 play scripts available, and one of them is annotated three times - independently by three annotators.
- Rights:
- THEAITRE AI research only license, https://lindat.mff.cuni.cz/repository/xmlui/page/theaitre-license, and ACA
38. APE Shared Task WMT17: Human Post-edits Test Data DE-EN
- Creator:
- Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
- Publisher:
- Fondazione Bruno Kessler, Trento, Italy
- Type:
- text and corpus
- Subject:
- Human post-edits, machine translation, shared task, automatic post-editing, and post-editing
- Language:
- English
- Description:
- Human post-edited test sentences for the WMT 2017 Automatic post-editing task. This consists in 2,000 English sentences belonging to the IT domain and already tokenized. Source and target segments can be downloaded from: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2132. All data is provided by the EU project QT21 (http://www.qt21.eu/).
- Rights:
- AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB
39. APE Shared Task WMT17: Human Post-edits Test Data EN-DE
- Creator:
- Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
- Publisher:
- Fondazione Bruno Kessler, Trento, Italy
- Type:
- text and corpus
- Subject:
- machine translation, human post-edits, shared task, automatic post-editing, and post-editing
- Language:
- German
- Description:
- Human post-edited test sentences for the WMT 2017 Automatic post-editing task. This consists in 2,000 German sentences belonging to the IT domain and already tokenized. Source and target segments can be downloaded from: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2133. All data is provided by the EU project QT21 (http://www.qt21.eu/).
- Rights:
- AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB
40. APE Shared Task WMT18: Human Post-edits and References Test Data EN-DE PBSMT
- Creator:
- Turchi, Marco, Negri, Matteo, and Chatterjee, Rajen
- Publisher:
- Fondazione Bruno Kessler, Trento, Italy
- Type:
- text and corpus
- Subject:
- automatic post-editing, post-editing, phrase-based MT, and reference translation
- Language:
- German
- Description:
- Human post-edited and reference test sentences for the En-De PBSMT WMT 2018 Automatic post-editing task. This consists of 2,000 German sentences for each file belonging to the IT domain and already tokenized. All data is provided by the EU project QT21 (http://www.qt21.eu/).
- Rights:
- AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB
41. Arabic ACL corpus
- Creator:
- Salah Elfahal Elebaed, Hoyam, Kasbi, Mohammed, Nasri, Mohammed, and Bouzoubaa, Karim
- Publisher:
- International Journal of Computer Science Trends and Technology (IJCST)
- Type:
- text and corpus
- Subject:
- Controlled Natural Language, Arabic CNL, ACL, Arabic Corpus, and and TEI.
- Language:
- Arabic
- Description:
- This corpus constitutes all sentences representing the Arabic Controlled Language (ACL). It contains 551 sentences taken from four textbooks and websites dedicated to teach Arabic language to kids such as: a) First grade book, Republic of Sudan (كتاب الصف الاول جمهورية السودان), b) Al Jazeera Educational Site (موقع الجزيرة التعليمي), c) Bella Preparatory School Girls Forum (منتدى مدرسة بيلا الاعدادية بنات), and d) Albahr website (موقع انا البحر). These sentences are respecting 52 ACL rules. The average number of sentences for each rule is 10.6. All sentences in the corpus were analyzed by Farasa syntactic parser to confirm they are correctly analyzed. The validity of the parsing was done manually by linguist experts. The structure of this corpus is made of a header and a body. The header consists of a set of metadata that describe the corpus, such as the corpus name, the authors, the sources and further meta data. While the header is made of metadata, the body contains rules. Each rule has a code, a structure and all sentences respecting that rule. For each sentence, we store an id, the vowelledand unvowelled text as well as the result of parsing using Farasa.
- Rights:
- Not specified
42. Arborest
- Type:
- corpus
- Language:
- Estonian
- Description:
- 149 sentences, VISL tagset
- Rights:
- Not specified
43. Artificial Treebank with Ellipsis
- Creator:
- Droganova, Kira, Zeman, Daniel, Kanerva, Jenna, and Ginter, Filip
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- universal dependencies, ellipsis, and gapping
- Language:
- English, Czech, Finnish, Russian, and Slovak
- Description:
- Artificially created treebank of elliptical constructions (gapping), in the annotation style of Universal Dependencies. Data taken from UD 2.1 release, and from large web corpora parsed by two parsers. Input data are filtered, sentences are identified where gapping could be applied, then those sentences are transformed, one or more words are omitted, resulting in a sentence with gapping. Details in Droganova et al.: Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions, LREC 2018, Miyazaki, Japan.
- Rights:
- Licence Universal Dependencies v2.1, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.1, and PUB
44. Arts and Humanities Data Service Literature, Languages and Linguistics
- Type:
- corpus
- Language:
- English
- Description:
- Electronic texts, corpora, lexicons. other
- Rights:
- Not specified
45. Aspect-Term Annotated Customer Reviews in Czech
- Creator:
- Fiala, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- sentiment analysis, opinion target, and customer review
- Language:
- Czech
- Description:
- This dataset contains a number of user product reviews which are publicly available on the website of an established Czech online shop with electronic devices. Each review consists of negative and positive aspects of the product. This setting pushes the customer to rate important characteristics. We have selected 2000 positive and negative segments from these reviews and manually tagged their targets. Additionally, we selected 200 of the longest reviews and annotated them in the same way. The targets were either aspects of the evaluated product or some general attributes (e.g. price, ease of use).
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
46. Audio and video database of Latvian folklore
- Publisher:
- Archives of Latvian Folklore, Institute of Literature, Folklore and Art, University of Latvia
- Format:
- application/octet-stream
- Type:
- corpus
- Language:
- Latvian
- Description:
- The database contains audio and video material related to traditional culture - songs, folktales, legends, life stories and various collective or individual folklore related performances. The content has been either specifically contributed to the Archives of Latvian Folklore or collected by its staff members.
- Rights:
- Not specified
47. Audio Recordings Archive
- Publisher:
- The Research Institute for the Languages of Finland
- Type:
- corpus
- Language:
- Finnish
- Description:
- The Audio Recordings Archive (Suomen kielen nauhoitearkisto) holds over 23,000 hours of recordings collected since 1959, providing authentic samples of Finnish dialects, languages related to Finnish, and other world languages. The collection additionally includes samples of Finnish dialects spoken in Sweden, Norway, Ingria, the United States and Australia. Digitisation of the audio bank was undertaken in 1999. Over half of its content has been digitised, totalling about 13,000 hours of recordings.
- Rights:
- Not specified
48. AudioPSP 24.01: Audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic
- Creator:
- Kopp, Matyáš
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- Parliament of the Czech Republic
- Language:
- Czech
- Description:
- This record contains audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic. The recordings have been provided by the official websites of the Chamber of Deputies, and the set contains them in their original format with no further processing. Recordings cover all available audio files from 2013-11-25 to 2023-07-26. Audio files are packed by year (2013-2023) and quarter (Q1-Q4) in tar archives audioPSP-YYYY-QN.tar. Furthermore, there are two TSV files: audioPSP-meta.quarterArchive.tsv contains metadata about archives, and audioPSP-meta.audioFile.tsv contains metadata about individual audio files.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
49. Automatic Paraphrases of Czech Reference Sentences for WMT11, 13 and 14
- Creator:
- Barančíková, Petra and Tamchyna, Aleš
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- machine translation, automatic evaluation, and paraphrasing
- Language:
- Czech
- Description:
- This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years 2011, 2013 and 2014. For each sentence, at most 10000 paraphrases were included (randomly selected from the full set). The goal of using this dataset is to improve automatic evaluation of machine translation outputs. If you use this work, please cite the following paper: Tamchyna Aleš, Barančíková Petra: Automatic and Manual Paraphrases for MT Evaluation. In proceedings of LREC, 2016.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
50. Automatically generated spelling correction corpus for Czech (Czech-SEC-AG)
- Creator:
- Hajič, Jan, Náplava, Jakub, and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- spelling correction and natural language correction
- Language:
- Czech
- Description:
- Automatically generated spelling correction corpus for Czech (Czesl-SEC-AG) is a corpus containg text with automatically generated spelling errors. To create spelling errors, a character error model containing probabilities of character substitution, insertion, deletion and probabilities of swaping two adjacent characters is used. Besides these probabilities, also the probabilities of changing character casing are considered. The original clean text on which the spelling errors were generated is PDT3.0 (http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3). The original train/dev/test sentence split of PDT3.0 corpus is preserved in this dataset. Besides the data with artificial spelling errors, we also publish texts from which the character error model was created. These are the original manual transcript of an audiobook Švejk and its corrected version performed by authors of Korektor (http://ufal.mff.cuni.cz/korektor). These data are similarly to CzeSL Grammatical Error Correction Dataset (CzeSL-GEC: http://hdl.handle.net/11234/1-2143) processed into four sets based on error difficulty present.
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB