Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC corpus.
Lexicon of Czech verbal multiword expressions (VMWEs) used in Parseme Shared Task 2017. https://typo.uni-konstanz.de/parseme/index.php/2-general/142-parseme-shared-task-on-automatic-detection-of-verbal-mwes
Lexicon consists of 4785 VMWEs, categorized into four categories according to Parseme Shared Task (PST) typology: IReflV (inherently reflexive verbs), LVC (light verb constructions), ID (idiomatic expressions) and OTH (other VMWEs with other than verbal syntactic head).
Verbal multiword expressions as well as deverbative variants of VMWEs were annotated during the preparation phase of PST. These data were published as http://hdl.handle.net/11372/LRT-2282. Czech part includes 14,536 VMWE occurences:
1611 ID
10000 IReflV
2923 LVC
2 OTH
This lexicon was created out of Czech data. Each lexicon entry is represented by one line in the form:
type lemmas frequency PoS [used form 1; used form 2; ... ]
(columns are separated by tabs) where:
type ... is the type of VMWE in PST typology
lemmas ... are space separated lemmatized forms of all words that constitutes the VMWE
frequency ... is the absolute frequency of this item in PST data
PoS ... is a space separated list of parts of speech of individual words (in the same order as in "lemmas")
final field contains a list of all (1 to 18) used forms found in the data (since Czech is a flective language).
The database offers access to over 6 million dialectal linguistic evidences of the project "Dictionary of Bavarian Dialects" (German: Das Bayerische Wörterbuch) as image snippets, partly and forthgoing lemmatized.
The area covered by the Dictionary of Bavarian Dialects (Bayerisches Wörterbuch) comprises Upper Bavaria, Lower Bavaria, the Upper Palatinate and neighbouring regions of Bavarian Swabia, Middle Franconia and Upper Franconia. Over and above the vernaculars spoken today, Bavaria’s literary tradition since its beginnings in the 8th century is also taken into account.
Starting in 1913, language material was collected from all Bavarian-speaking regions in Bavaria. Questionnaires were sent out to local informants throughout Bavaria, and contemporary and historical literary sources were excerpted. Today the collection comprises around nine million dialect examples. With the exception of the “Wörterlisten” (word lists), which can be digitally searched and edited, this material consists of index cards, to which corresponding standard German or quasi-standard German keywords have been added, filed alphabetically (see link below for more information).
For detailed information, please see https://www.bwb.badw.de/en/the-project.html and https://www.bwb.badw.de/en/digital-platform.html
In the Middle Ages, Old Occitan (formerly "Old Provençal"), the language of the troubadours, was a literary and cultural language, the influence of which extended far beyond the frontiers of Southern France.
The only comprehensive portrayal of the Old Occitan vocabulary to have appeared up to now is the "Lexique roman" by François Raynouard (6 vols., 1836–1845). It was supplemented by Emil Levy’s "Provenzalisches Supplementwörterbuch" (8 vols., 1894–1924). An updated dictionary, taking account of progress in research over the last 100 years, has been the desideratum of literary scholars, linguists, and historians ever since.
Under the direction of Wolf-Dieter Stempel, the publication of a new dictionary of Old Occitan, the "Dictionnaire de l'occitan médiéval (DOM)", began in 1996. This appeared in print until 2013, directed from 2012 on by Maria Selig. Since then it has been available as an alphabetically complete digital dictionary, the "DOM en ligne". This comprises the newly written articles of the DOM together with the articles from the dictionaries of Raynouard and Levy for those parts of the alphabet not yet covered by the new work and is enriched by entries for words absent till now from Old Occitan lexicography.
Its content is available for free at https://dom-en-ligne.de/dom.php
The corpus presented consists of job ads in Spanish related to Engineering positions in Peru.
The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks.
The corpus is divided in two components:
- POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format.
- Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats:
* Whole text documents: containing all the information originally posted in the ad.
* Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)
Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a collection of texts, selected to be representative of Arabic stemming tasks and manually annotated.
The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) is a corpus of spoken language, consisting of 742,316 tokens and 73,835 sentences, representing 7,324 minutes (over 120 hours) of spontaneous dialogs. The dialogs have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcripts and manually reconstructed text. These layers were part of the first version of the corpus (PDTSC 1.0). Version 2.0 is extended by an automatic dependency parser at the analytical and by the manual annotation of “deep” syntax at the tectogrammatical layer, which contains semantic roles and relations as well as annotation of coreference.
Dataset collected from natural dialogs which enables to test the ability of dialog systems to interactively learn new facts from user utterances throughout the dialog. The dataset, consisting of 1900 dialogs, allows simulation of an interactive gaining of denotations and questions explanations from users which can be used for the interactive learning.