CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech and Czech pupils with Romani background. To create this corpus, unreleased CzeSL-man corpus (http://utkl.ff.cuni.cz/learncorp/) was utilized. All sentences in the corpus are word tokenized.
We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.
Danish Fungi 2020 (DF20) is a fine-grained dataset and benchmark. The dataset, constructed from observations submitted to the Danish Fungal Atlas, is unique in its taxonomy-accurate class labels, small number of errors, highly unbalanced long-tailed class distribution, rich observation metadata, and well-defined class hierarchy. DF20 has zero overlap with ImageNet, allowing unbiased comparison of models fine-tuned from publicly available ImageNet checkpoints.
The dataset has 1,604 different classes, with 248,466 training images and 27,608 test images.
The corpus contains Czech speech of laryngectomy patients recorded before a surgery causing their voice to be lost in order to preserve the voice which can be later used for personalized text-to-speech system. Individual utterances were selected from the language by a special algorithm to cover as much phonetic and prosodic features as possible.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 38A from 1943 contains footage from the Days of Czech Youth event organised by the Board of Trustees for the Education of Youth from 11 to 12 September. A concert of three brass bands, led by Miloš Kuba, and the Kühn Children´s Choir was held on Peace Square at 5 pm on 11 September. A procession of the Board´s members set out from Peace Square and continued through the streets of Prague. The event culminated with a track and field championship at Strahov Stadium where the winners of district rounds competed against each other. The spectators were welcomed by General Secretary of the Board František Teuner. The programme included a dance performance by girls in folk costumes. The event concluded with a speech by Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec, followed by a solemn oath "to the Führer and to the Fatherland".
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 35A from 1943 captures the mood of the District Youth Track and Field Championship for Ages 10-18, which was organised by the Board of Trustees for the Education of Youth in eighty towns of the Protectorate as part of the Days of Czech Youth event held from 28 to 29 August 1943. At the A. F. K. Stadium in Kolín nad Labem, approximately 1,500 athletes qualified for the Track and Field Championship of Bohemia and Moravia.
This corpus contains the text of De Latinae Linguae Reparatione authored by Marcus Antonius Sabellicus (1436–1506), annotated with respect to lemmas, part-of-speech tags, morphological features and syntactic dependencies according to the typological formalism of Universal Dependencies (UD).
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 17B from 1945 shows a competition for the best decorated Easter egg, which was organised by girls from the Moravian Slovak branch of the Board of Trustees for the Education of Youth as part of the youth service of honour. Local women artisans, skilled in the traditional techniques, helped them with painting and etching patterns on Easter eggs.