The dataset contains two parts: the original Stanford Natural Language Inference (SNLI) dataset with automatic translations to Czech, for some items from the SNLI, it contains annotation of the Czech content and explanation.
The Czech SNLI data contain both Czech and English pairs premise-hypothesis. SNLI split into train/test/dev is preserved.
- CZtrainSNLI.csv: 550152 pairs
- CZtestSNLI.csv: 10000 pairs
- CZdevSNLI.csv: 10000 pairs
The explanation dataset contains batches of pairs premise-hypothesis. Each batch contains 1499 pairs. Each pair contains:
- reference to original SNLI example
- English premise and English hypothesis
- English gold label (one of Entailment, Contradiction, Neutral)
- automatically translated premise and hypothesis to Czech
- Czech gold label (one of entailment, contradiction, neutral, bad translation)
- explanations for Czech label
Example record:
CSNLI ID: 4857558207.jpg#4r1e
English premise: A mother holds her newborn baby.
English hypothesis: A person holding a child.
English gold label: entailment
Czech premise: Matka drží své novorozené dítě.
Czech hypothesis: Osoba, která drží dítě.
Czech gold label: Entailment
Explanation-hypothesis: Matka
Explanation-premise: Osoba
Explanation-relation: generalization
Size of the explanations dataset:
- train: 159650
- dev: 2860
- test: 2880
Inter-Annotator Agreement (IAA)
Packages 1 and 12 annotate the same data. The IAA measured by the kappa score is 0.67 (substantial agreement).
The translation was performed via LINDAT translation service.
Next, the translated pairs were manually checked (without access to the original English gold label), with possible check of the original pair.
Explanations were annotated as follows:
- if there is a part of the premise or hypothesis that is relevant for the annotator's decision, it is marked
- if there are two such parts and there exists a relation between them, the relation is marked
Possible relation types:
- generalization: white long skirt - skirt
- specification: dog - bulldog
- similar: couch - sofa
- independence: they have no instruments - they belong to the group
- exclusion: man - woman
Original SNLI dataset: https://nlp.stanford.edu/projects/snli/
LINDAT Translation Service: https://lindat.mff.cuni.cz/services/translation/
Czech OOV Inflection Dataset is a Czech inflection dataset of nouns, focused on evaluation in out-of-vocabulary (OOV) conditions. It consists of two parts: a standard lemma-disjoint train-dev-test split of a subset of noun paradigms of existing morphological dictionary Czech MorfFlex 2.0 (files train, dev and test-MorfFlex); and small set of neologisms from Čeština 2.0, annotated for inflected forms (file test-neologisms).
The corpus consists of recordings from the Chamber of Deputies of the Parliament of the Czech Republic. It currently consists of 88 hours of speech data, which corresponds roughly to 0.5 million tokens. The annotation process is semi-automatic, as we are able to perform the speech recognition on the data with high accuracy (over 90%) and consequently align the resulting automatic transcripts with the speech. The annotator’s task is then to check the transcripts, correct errors, add proper punctuation and label speech sections with information about the speaker. The resulting corpus is therefore suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 18 sound files (WAV PCM, 16-bit, 44.1 kHz, mono) and corresponding transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
The date of airing of a particular recording is encoded in the filename in the form SOUND_YYMMDD_*. Note that the recordings are usually aired in the early morning on the day following the actual Parliament session. If the recording is too long to fit in the broadcasting scheme, it is divided into several parts and aired on the consecutive days.
Tokenizer, POS Tagger, Lemmatizer, and Parser model based on the PDT-C 1.0 treebank (https://hdl.handle.net/11234/1-3185). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#czech_pdtc1.0_model . To use these models, you need UDPipe version 2.1, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
CERED (Czech Relationship Dataset) is a family of datasets created via distant supervision on Czech Wikipedia and Wikidata. It was created as part of a thesis on Relationship Extraction (2020).
CERED0 is the largest dataset, it lacks negative relation and its relation inventory is huge.
CERED*n* is a subset of CERED*n-1* that satisfies some conditions. The methodology of curating the datasets is detailed in the thesis.
The format of the data is jsonL and the tools used to generate the dataset is python.
This is a dataset for natural language generation (NLG) in task-oriented spoken dialogue systems with Czech as the target language. It originated as a translation of the English San Francisco Restaurants dataset by Wen et al. (2015).
It includes input dialogue acts and the corresponding output natural language paraphrases in Czech. Since the dataset is intended for recurrent neural network based NLG systems using delexicalization, inflection tables for all slot values appearing verbatim in the text are provided.
The Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0) is a dataset of 54 Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST). Each text document in the treebank is represented as a single tree-like structure, the nodes (discourse units) are interconnected through hierarchical rhetorical relations.
The dataset also contains concurrent annotations of five double-annotated documents.
The original texts are a part of the data annotated in the Prague Dependency Treebank, although the two projects are independent.
The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors. and European Commission Sixth Framework Programme
Information Society Technologies Integrated Project IST-34434
Selected research articles and essays published in Czech Sociological Review from 1993 to 2016. Originally Czech, non-translated material only. 522 documents in total.
In terms of linguistic annotation, the corpus is lemmatised and tagged with morphosyntactic descriptors (MSDs).
Czech subjectivity lexicon, i.e. a list of subjectivity clues for sentiment analysis in Czech. The list contains 4626 evaluative items (1672 positive and 2954 negative) together with their part of speech tags, polarity orientation and source information.
The core of the Czech subjectivity lexicon has been gained by automatic translation of a freely available English subjectivity lexicon downloaded from http://www.cs.pitt.edu/mpqa/subj_lexicon.html. For translating the data into Czech, we used parallel corpus CzEng 1.0 containing 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation. Afterwards, the lexicon has been manually refined by an experienced annotator. and The work on this project has been supported by the GAUK 3537/2011 grant and by SVV project number 267 314.