Harvested from: LINDAT/CLARIAH-CZ repository / Type: text - LINDAT/CLARIAH-CZ Catalog Search Results

1. Amharic Web Corpus

Creator:: Suchomel, Vít and Rychlý, Pavel
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: Amharic, text corpus, Web corpus, under-resourced language, corpus annotation, and morphological tagger
Language:: Amharic
Description:: Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC corpus.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

2. Amharic WIC Corpus

Creator:: Rychlý, Pavel
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: text corpora, Ethiopian languages, web corpora, under-resourced languages, and Amharic
Language:: Amharic
Description:: Substantially cleaned version of existing morphologically annotated WIC Corpus.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

3. Czech Verbal MWEs

Creator:: Bejček, Eduard
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexicon, verbs, multiword expressions, forms, and lemmatization
Language:: Czech
Description:: Lexicon of Czech verbal multiword expressions (VMWEs) used in Parseme Shared Task 2017. https://typo.uni-konstanz.de/parseme/index.php/2-general/142-parseme-shared-task-on-automatic-detection-of-verbal-mwes Lexicon consists of 4785 VMWEs, categorized into four categories according to Parseme Shared Task (PST) typology: IReflV (inherently reflexive verbs), LVC (light verb constructions), ID (idiomatic expressions) and OTH (other VMWEs with other than verbal syntactic head). Verbal multiword expressions as well as deverbative variants of VMWEs were annotated during the preparation phase of PST. These data were published as http://hdl.handle.net/11372/LRT-2282. Czech part includes 14,536 VMWE occurences: 1611 ID 10000 IReflV 2923 LVC 2 OTH This lexicon was created out of Czech data. Each lexicon entry is represented by one line in the form: type lemmas frequency PoS [used form 1; used form 2; ... ] (columns are separated by tabs) where: type ... is the type of VMWE in PST typology lemmas ... are space separated lemmatized forms of all words that constitutes the VMWE frequency ... is the absolute frequency of this item in PST data PoS ... is a space separated list of parts of speech of individual words (in the same order as in "lemmas") final field contains a list of all (1 to 18) used forms found in the data (since Czech is a flective language).
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

4. Dictionary of Bavarian Dialects

Creator:: Schamberger-Hirt, Andrea, Erhard, Felicitas, Schnabel, Michael, Funk, Edith, Rowley, Anthony, and Schwab, Vincenz
Publisher:: Bayerisches Wörterbuch and Bayerische Akademie der Wissenschaften
Type:: text and corpus
Subject:: Bavaria, Bayern, Dialektologie, Dialekt, Dialectology, Bavarian, Bairisch, Bayerisch, dialect variation, Germanistik, German, Historical Linguistics, and History of German Language and Literature
Language:: Bavarian and German
Description:: The database offers access to over 6 million dialectal linguistic evidences of the project "Dictionary of Bavarian Dialects" (German: Das Bayerische Wörterbuch) as image snippets, partly and forthgoing lemmatized. The area covered by the Dictionary of Bavarian Dialects (Bayerisches Wörterbuch) comprises Upper Bavaria, Lower Bavaria, the Upper Palatinate and neighbouring regions of Bavarian Swabia, Middle Franconia and Upper Franconia. Over and above the vernaculars spoken today, Bavaria’s literary tradition since its beginnings in the 8th century is also taken into account. Starting in 1913, language material was collected from all Bavarian-speaking regions in Bavaria. Questionnaires were sent out to local informants throughout Bavaria, and contemporary and historical literary sources were excerpted. Today the collection comprises around nine million dialect examples. With the exception of the “Wörterlisten” (word lists), which can be digitally searched and edited, this material consists of index cards, to which corresponding standard German or quasi-standard German keywords have been added, filed alphabetically (see link below for more information). For detailed information, please see https://www.bwb.badw.de/en/the-project.html and https://www.bwb.badw.de/en/digital-platform.html
Rights:: Not specified

5. Dictionnaire de l'occitan médiéval (DOM)

Creator:: Claudia, Kraus, Stempel, Wolf-Dieter, Tausend, Monika, and Peter, Renate
Publisher:: Bavarian Academy of Sciences and Humanities and Bayerische Akademie der Wissenschaften
Type:: text, lexicon, and lexicalConceptualResource
Subject:: Emil Levy, Petit Levy, Lexique Roman, DOM, Occitian language, Medieval Occitan, Occitan, Old Occitan, Old Provençal, Romance languages, dictionary, etymology, Middle Ages, troubadours, lexicography, and Supplementwörterbuch
Language:: French and Old Provençal (to 1500)
Description:: In the Middle Ages, Old Occitan (formerly "Old Provençal"), the language of the troubadours, was a literary and cultural language, the influence of which extended far beyond the frontiers of Southern France. The only comprehensive portrayal of the Old Occitan vocabulary to have appeared up to now is the "Lexique roman" by François Raynouard (6 vols., 1836–1845). It was supplemented by Emil Levy’s "Provenzalisches Supplementwörterbuch" (8 vols., 1894–1924). An updated dictionary, taking account of progress in research over the last 100 years, has been the desideratum of literary scholars, linguists, and historians ever since. Under the direction of Wolf-Dieter Stempel, the publication of a new dictionary of Old Occitan, the "Dictionnaire de l'occitan médiéval (DOM)", began in 1996. This appeared in print until 2013, directed from 2012 on by Maria Selig. Since then it has been available as an alphabetically complete digital dictionary, the "DOM en ligne". This comprises the newly written articles of the DOM together with the articles from the dictionaries of Raynouard and Levy for those parts of the alphabet not yet covered by the new work and is enriched by entries for words absent till now from Old Occitan lexicography. Its content is available for free at https://dom-en-ligne.de/dom.php
Rights:: Not specified

6. Engineering job ads corpus

Creator:: Cardenas Acosta, Ronald, Bello Medina, Kevin, Coronado, Alberto, and Villota, Elizabeth
Publisher:: National University of Engineering, Peru
Type:: text and corpus
Subject:: job-advertisement, PoS tagging, and text corpora
Language:: Spanish
Description:: The corpus presented consists of job ads in Spanish related to Engineering positions in Peru. The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks. The corpus is divided in two components: - POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format. - Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats: * Whole text documents: containing all the information originally posted in the ad. * Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

7. NAFIS Arabic Stemming Gold Standard Corpus

Creator:: Namly, Driss
Publisher:: Ibtikarat team
Type:: text, wordList, and lexicalConceptualResource
Subject:: corpus, stemming;, and Gold Standard Corpus
Language:: Arabic
Description:: Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a collection of texts, selected to be representative of Arabic stemming tasks and manually annotated.
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

8. Oromo web corpus

Creator:: Suchomel, Vít and Rychlý, Pavel
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: text corpora, Ethiopian languages, Oromo, Web corpus, and under-resourced language
Language:: Oromo
Description:: Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

9. Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0)

Creator:: Mikulová, Marie, Bémová, Alevtina, Hajič, Jan, Hajičová, Eva, Ircing, Pavel, Kolářová, Veronika, Lopatková, Markéta, Mareček, David, Mírovský, Jiří, Nedoluzhko, Anna, Pajas, Petr, Panevová, Jarmila, Peterek, Nino, Romportl, Jan, Sgall, Petr, Ševčíková, Magda, Štěpánek, Jan, Urešová, Zdeňka, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: spoken corpus, speech reconstruction, speech recognition, syntax, semantics, coreference, and audio
Language:: Czech
Description:: The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) is a corpus of spoken language, consisting of 742,316 tokens and 73,835 sentences, representing 7,324 minutes (over 120 hours) of spontaneous dialogs. The dialogs have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcripts and manually reconstructed text. These layers were part of the first version of the corpus (PDTSC 1.0). Version 2.0 is extended by an automatic dependency parser at the analytical and by the manual annotation of “deep” syntax at the tectogrammatical layer, which contains semantic roles and relations as well as annotation of coreference.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

10. Question Dialogs Dataset

Creator:: Vodolán, Miroslav and Jurčíček, Filip
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, other, and lexicalConceptualResource
Subject:: question dialogs and interactive learning
Language:: English
Description:: Dataset collected from natural dialogs which enables to test the ability of dialog systems to interactively learn new facts from user utterances throughout the dialog. The dataset, consisting of 1900 dialogs, allows simulation of an interactive gaining of denotations and questions explanations from users which can be used for the interactive learning.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

1. Amharic Web Corpus

2. Amharic WIC Corpus

3. Czech Verbal MWEs

4. Dictionary of Bavarian Dialects

5. Dictionnaire de l'occitan médiéval (DOM)

6. Engineering job ads corpus

7. NAFIS Arabic Stemming Gold Standard Corpus

8. Oromo web corpus

9. Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0)

10. Question Dialogs Dataset

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Rights

Show values starting with

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from