The dataset contains 4731 frozen continuous Czech multiword expressions. Inflectional word forms are generated for those MWEs where applicable. In total, the dataset contains 24,807 MWE forms.
Czech Named Entity Corpus 1.1 fixes some issues of the Czech Named Entity Corpus 1.0: misannotated entities are fixed, all formats contain the same data, tmt format is replaced with treex format, all formats contain splitting into training, development and testing portion of the data. and SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky), LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat), GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny), PRVOUK (PRVOUK)
Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities. and SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky), LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat), GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny), PRVOUK (PRVOUK)
The corpus consists of recordings from the Chamber of Deputies of the Parliament of the Czech Republic. It currently consists of 88 hours of speech data, which corresponds roughly to 0.5 million tokens. The annotation process is semi-automatic, as we are able to perform the speech recognition on the data with high accuracy (over 90%) and consequently align the resulting automatic transcripts with the speech. The annotator’s task is then to check the transcripts, correct errors, add proper punctuation and label speech sections with information about the speaker. The resulting corpus is therefore suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 18 sound files (WAV PCM, 16-bit, 44.1 kHz, mono) and corresponding transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
The date of airing of a particular recording is encoded in the filename in the form SOUND_YYMMDD_*. Note that the recordings are usually aired in the early morning on the day following the actual Parliament session. If the recording is too long to fit in the broadcasting scheme, it is divided into several parts and aired on the consecutive days.
Tokenizer, POS Tagger, Lemmatizer, and Parser model based on the PDT-C 1.0 treebank (https://hdl.handle.net/11234/1-3185). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#czech_pdtc1.0_model . To use these models, you need UDPipe version 2.1, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
CERED (Czech Relationship Dataset) is a family of datasets created via distant supervision on Czech Wikipedia and Wikidata. It was created as part of a thesis on Relationship Extraction (2020).
CERED0 is the largest dataset, it lacks negative relation and its relation inventory is huge.
CERED*n* is a subset of CERED*n-1* that satisfies some conditions. The methodology of curating the datasets is detailed in the thesis.
The format of the data is jsonL and the tools used to generate the dataset is python.
This is a dataset for natural language generation (NLG) in task-oriented spoken dialogue systems with Czech as the target language. It originated as a translation of the English San Francisco Restaurants dataset by Wen et al. (2015).
It includes input dialogue acts and the corresponding output natural language paraphrases in Czech. Since the dataset is intended for recurrent neural network based NLG systems using delexicalization, inflection tables for all slot values appearing verbatim in the text are provided.
The Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0) is a dataset of 54 Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST). Each text document in the treebank is represented as a single tree-like structure, the nodes (discourse units) are interconnected through hierarchical rhetorical relations.
The dataset also contains concurrent annotations of five double-annotated documents.
The original texts are a part of the data annotated in the Prague Dependency Treebank, although the two projects are independent.
The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors. and European Commission Sixth Framework Programme
Information Society Technologies Integrated Project IST-34434
Selected research articles and essays published in Czech Sociological Review from 1993 to 2016. Originally Czech, non-translated material only. 522 documents in total.
In terms of linguistic annotation, the corpus is lemmatised and tagged with morphosyntactic descriptors (MSDs).