« Previous |
1 - 10 of 178
|
Next »
Number of results to display per page
Search Results
2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
3. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)
- Creator:
- Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- NER, named entity recognition, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials
- Creator:
- Novotný, Vít and Horák, Aleš
- Publisher:
- Masaryk University, Brno
- Type:
- text and corpus
- Subject:
- ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
- Language:
- Czech, English, German, and Latin
- Description:
- These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
- Rights:
- Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB
5. Aging effects in an evolving phonological network
- Creator:
- Luef, Eva Maria
- Publisher:
- Charles University
- Type:
- text and corpus
- Subject:
- network aging, English as a second language, network evolution, phonological network, and preferential attachment
- Language:
- English
- Description:
- Phonological networks are representations of word forms and their phonological relationships with other words in a given language lexicon. A principle underlying the growth (or evolution) of those networks is preferential attachment, or the ‘rich-gets-richer’ mechanisms, according to which words with many phonological neighbors (or links) are the main beneficiaries of future growth opportunities. Due to their limited number of words, language lexica constitute node-constrained networks where growth cannot keep increasing in a linear way; hence, preferential attachment is likely mitigated by certain factors. The present study investigated aging effects (i.e., a word’s finite time span of being active in terms of growth) in an evolving phonological network of English as a second language. It was found that phonological neighborhoods are constructed by one large initial lexical spurt, followed by sublinear growth spurts that eventually lead to very limited growth in later lexical spurts during network evolution, all the while obeying the law of preferential attachment. An analysis of the strength of phonological relationships between phonological word forms revealed a tendency to attach more distant phonological neighbors in the lower proficiency levels, while phonologically more similar neighbors enter phonological neighborhoods at more advanced levels of English as a second language. Overall, the findings suggest an aging effect in growth that favors younger words. In addition, beginning learners seem to prefer the acquisition of phonological neighbors that are easier to discriminate. Implications for the second language lexicon include leveraged learning mechanisms, learning bouts focussed on a smaller range of phonological segments, and involve questions concerning lexical processing in aging networks.
- Rights:
- Creative Commons - Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0), http://creativecommons.org/licenses/by-nd/4.0/, and PUB
6. Alex Context NLG Dataset
- Creator:
- Dušek, Ondřej and Jurčíček, Filip
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- dialogue system, natural language generation, dialogue alignment, and entrainment
- Language:
- English
- Description:
- A dataset intended for fully trainable natural language generation (NLG) systems in task-oriented spoken dialogue systems (SDS), covering the English public transport information domain. It includes preceding context (user utterance) along with each data instance (pair of source meaning representation and target natural language paraphrase to be generated). Taking the form of the previous user utterance into account for generating the system response allows NLG systems trained on this dataset to entrain (adapt) to the preceding utterance, i.e., reuse wording and syntactic structure. This should presumably improve the perceived naturalness of the output, and may even lead to a higher task success rate. Crowdsourcing has been used to obtain natural context user utterances as well as natural system responses to be generated.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
7. Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1)
- Creator:
- Ramisch, Carlos, Cordeiro, Silvio Ricardo, Savary, Agata, Vincze, Veronika, Barbu Mititelu, Verginica, Bhatia, Archna, Buljan, Maja, Candito, Marie, Gantar, Polona, Giouli, Voula, Güngör, Tunga, Hawwari, Abdelati, Iñurrieta, Uxoa, Kovalevskaitė, Jolanta, Krek, Simon, Lichte, Timm, Liebeskind, Chaya, Monti, Johanna, Parra Escartín, Carla, QasemiZadeh, Behrang, Ramisch, Renata, Schneider, Nathan, Stoyanova, Ivelina, Vaidya, Ashwini, Walsh, Abigail, Aceta, Cristina, Aduriz, Itziar, Antoine, Jean-Yves, Arhar Holdt, Špela, Berk, Gözde, Bielinskienė, Agnė, Blagus, Goranka, Boizou, Loic, Bonial, Claire, Caruso, Valeria, Čibej, Jaka, Constant, Matthieu, Cook, Paul, Diab, Mona, Dimitrova, Tsvetana, Ehren, Rafael, Elbadrashiny, Mohamed, Elyovich, Hevi, Erden, Berna, Estarrona, Ainara, Fotopoulou, Aggeliki, Foufi, Vassiliki, Geeraert, Kristina, van Gompel, Maarten, Gonzalez, Itziar, Gurrutxaga, Antton, Ha-Cohen Kerner, Yaakov, Ibrahim, Rehab, Ionescu, Mihaela, Jain, Kanishka, Jazbec, Ivo-Pavao, Kavčič, Teja, Klyueva, Natalia, Kocijan, Kristina, Kovács, Viktória, Kuzman, Taja, Leseva, Svetlozara, Ljubešić, Nikola, Malka, Ruth, Markantonatou, Stella, Martínez Alonso, Héctor, Matas, Ivana, McCrae, John, de Medeiros Caseli, Helena, Onofrei, Mihaela, Palka-Binkiewicz, Emilia, Papadelli, Stella, Parmentier, Yannick, Pascucci, Antonio, Pasquer, Caroline, Pia di Buono, Maria, Puri, Vandana, Raffone, Annalisa, Ratori, Shraddha, Riccio, Anna, Sangati, Federico, Shukla, Vishakha, Simkó, Katalin, Šnajder, Jan, Somers, Clarissa, Srivastava, Shubham, Stefanova, Valentina, Taslimipoor, Shiva, Theoxari, Natasa, Todorova, Maria, Urizar, Ruben, Villavicencio, Aline, and Zilio, Leonardo
- Publisher:
- PARSEME
- Type:
- text and corpus
- Subject:
- Multiword expressions, verbal multiword expressions, light-verb constructions, verb-particle constructions, inherently reflexive verbs, verbal idioms, and multi-verb constructions
- Language:
- Bulgarian, German, Modern Greek (1453-), Spanish, Persian, French, Hebrew, Hungarian, Italian, Lithuanian, Polish, Portuguese, Romanian, Slovenian, Turkish, Hindi, Basque, English, and Croatian
- Description:
- This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). VMWEs were annotated according to the universal guidelines in 19 languages. The corpora are provided in the cupt format, inspired by the CONLL-U format. The corpora were used in the 1.1 edition of the PARSEME Shared Task (2018). For most languages, morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.1 (2018). The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1
- Rights:
- PARSEME Shared Task Data (v. 1.1) Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.1, and PUB
8. APE Shared Task WMT17: Human Post-edits Test Data DE-EN
- Creator:
- Turchi, Marco, Chatterjee, Rajen, and Negri, Matteo
- Publisher:
- Fondazione Bruno Kessler, Trento, Italy
- Type:
- text and corpus
- Subject:
- Human post-edits, machine translation, shared task, automatic post-editing, and post-editing
- Language:
- English
- Description:
- Human post-edited test sentences for the WMT 2017 Automatic post-editing task. This consists in 2,000 English sentences belonging to the IT domain and already tokenized. Source and target segments can be downloaded from: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2132. All data is provided by the EU project QT21 (http://www.qt21.eu/).
- Rights:
- AGREEMENT ON THE USE OF DATA IN QT21 APE Task, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21, and PUB
9. Artificial Treebank with Ellipsis
- Creator:
- Droganova, Kira, Zeman, Daniel, Kanerva, Jenna, and Ginter, Filip
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- universal dependencies, ellipsis, and gapping
- Language:
- English, Czech, Finnish, Russian, and Slovak
- Description:
- Artificially created treebank of elliptical constructions (gapping), in the annotation style of Universal Dependencies. Data taken from UD 2.1 release, and from large web corpora parsed by two parsers. Input data are filtered, sentences are identified where gapping could be applied, then those sentences are transformed, one or more words are omitted, resulting in a sentence with gapping. Details in Droganova et al.: Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions, LREC 2018, Miyazaki, Japan.
- Rights:
- Licence Universal Dependencies v2.1, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.1, and PUB
10. ATCC: Pronunciation lexicon and n-gram counts for ASR module
- Creator:
- Šmídl, Luboš
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- text, lexicalConceptualResource, and other
- Subject:
- pronunciation lexicon, n-gram counts, and language model
- Language:
- English
- Description:
- The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0). and Technology Agency of the Czech Republic, project No. TA01030476
- Rights:
- Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0), http://creativecommons.org/licenses/by-nc/3.0/, and PUB