CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech and Czech pupils with Romani background. To create this corpus, unreleased CzeSL-man corpus (http://utkl.ff.cuni.cz/learncorp/) was utilized. All sentences in the corpus are word tokenized.
Neusatz und Faksimile der zehnbändigen Ausgabe (Leipzig, 1834-1838); wortgenaue Seitenkonkordanz zu der gedruckten Ausgabe; Darstellung der Gegenstandsbereiche gesellschaftlicher Konversation (speziell auf eine weibliche Zielgruppe ausgerichtet)
We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.
Danish Fungi 2020 (DF20) is a fine-grained dataset and benchmark. The dataset, constructed from observations submitted to the Danish Fungal Atlas, is unique in its taxonomy-accurate class labels, small number of errors, highly unbalanced long-tailed class distribution, rich observation metadata, and well-defined class hierarchy. DF20 has zero overlap with ImageNet, allowing unbiased comparison of models fine-tuned from publicly available ImageNet checkpoints.
The dataset has 1,604 different classes, with 248,466 training images and 27,608 test images.
Register of decrees as well as texts on the history of Prussia and the Teutonic Order; Regesten und Texte zur Geschichte Preußens und des Deutschen Ordens
The database contains about 5 Million dialectal linguistic evidences collected in differend projects within the Free State of Bavaria to the dialects Bavarian, Frankish, and Swabian.
In 1984, linguists at the University of Augsburg began to collect dialect data for the research and documentation project "Linguistic Map of Swabia" (German: "Sprachatlas von Bayerisch-Schwaben (SBS)"). In 1986, the University of Bayreuth followed with preparations for the "Linguistic Map of North- and East-Bavaria" (German: "Sprachatlas von Nordostbayern (SNOB)"). In the following years, partner projects of the other regions also started to collect data in their particular region. All six language projects then formed the "Research Association of the Bavarian Linguistic Map " (German: Bayerischer Sprachatlas (BSA)"), which was funded by the DFG and the Bavarian State Ministry of Science, Research and the Arts.
The first digital publication of BayDat by Ralf Zimmermann in 2007 at the University of Würzburg (see linked paper) was re-designed in 2019 by Manuel Raaf at the Bavarian Academy of Sciences and Humanities.
For detailed information, please see https://baydat.badw.de/info