Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC corpus.
The system Česílko (language data and software tools) was first developed as an answer to a growing need of translation and localisation from one source language to many target languages. The starting system belonged to the Shallow Parse, Shallow Transfer Rule-Based Machine Translation – (RBMT) paradigm and it was designed primarily for translation of related languages. The latest implementation of the system uses a stochastic ranker; so technically it belongs to the hybrid machine translation paradigm, using stochastic methods combined with the traditional Shallow Transfer RBMT methods. The system has been stripped of the accompanying language resources due to copyright restrictions. The data that is available is just for demonstrative purposes.
The database offers access to over 6 million dialectal linguistic evidences of the project "Dictionary of Bavarian Dialects" (German: Das Bayerische Wörterbuch) as image snippets, partly and forthgoing lemmatized.
The area covered by the Dictionary of Bavarian Dialects (Bayerisches Wörterbuch) comprises Upper Bavaria, Lower Bavaria, the Upper Palatinate and neighbouring regions of Bavarian Swabia, Middle Franconia and Upper Franconia. Over and above the vernaculars spoken today, Bavaria’s literary tradition since its beginnings in the 8th century is also taken into account.
Starting in 1913, language material was collected from all Bavarian-speaking regions in Bavaria. Questionnaires were sent out to local informants throughout Bavaria, and contemporary and historical literary sources were excerpted. Today the collection comprises around nine million dialect examples. With the exception of the “Wörterlisten” (word lists), which can be digitally searched and edited, this material consists of index cards, to which corresponding standard German or quasi-standard German keywords have been added, filed alphabetically (see link below for more information).
For detailed information, please see https://www.bwb.badw.de/en/the-project.html and https://www.bwb.badw.de/en/digital-platform.html
In the Middle Ages, Old Occitan (formerly "Old Provençal"), the language of the troubadours, was a literary and cultural language, the influence of which extended far beyond the frontiers of Southern France.
The only comprehensive portrayal of the Old Occitan vocabulary to have appeared up to now is the "Lexique roman" by François Raynouard (6 vols., 1836–1845). It was supplemented by Emil Levy’s "Provenzalisches Supplementwörterbuch" (8 vols., 1894–1924). An updated dictionary, taking account of progress in research over the last 100 years, has been the desideratum of literary scholars, linguists, and historians ever since.
Under the direction of Wolf-Dieter Stempel, the publication of a new dictionary of Old Occitan, the "Dictionnaire de l'occitan médiéval (DOM)", began in 1996. This appeared in print until 2013, directed from 2012 on by Maria Selig. Since then it has been available as an alphabetically complete digital dictionary, the "DOM en ligne". This comprises the newly written articles of the DOM together with the articles from the dictionaries of Raynouard and Levy for those parts of the alphabet not yet covered by the new work and is enriched by entries for words absent till now from Old Occitan lexicography.
Its content is available for free at https://dom-en-ligne.de/dom.php
The corpus presented consists of job ads in Spanish related to Engineering positions in Peru.
The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks.
The corpus is divided in two components:
- POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format.
- Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats:
* Whole text documents: containing all the information originally posted in the ad.
* Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)
Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a collection of texts, selected to be representative of Arabic stemming tasks and manually annotated.
Dataset collected from natural dialogs which enables to test the ability of dialog systems to interactively learn new facts from user utterances throughout the dialog. The dataset, consisting of 1900 dialogs, allows simulation of an interactive gaining of denotations and questions explanations from users which can be used for the interactive learning.
Restaurant Reviews CZ ABSA - 2.15k reviews with their related target and category
The work done is described in the paper: https://doi.org/10.13053/CyS-20-3-2469