A lexicographical project, whose aim is to digitize and align two Czech onomasiological dictionaries (Haller 1969–77; Klégr 2007) in order to create an integrated digital multi-purpose lexico-semantic database of Czech.
The dataset contains two parts: the original Stanford Natural Language Inference (SNLI) dataset with automatic translations to Czech, for some items from the SNLI, it contains annotation of the Czech content and explanation.
The Czech SNLI data contain both Czech and English pairs premise-hypothesis. SNLI split into train/test/dev is preserved.
- CZtrainSNLI.csv: 550152 pairs
- CZtestSNLI.csv: 10000 pairs
- CZdevSNLI.csv: 10000 pairs
The explanation dataset contains batches of pairs premise-hypothesis. Each batch contains 1499 pairs. Each pair contains:
- reference to original SNLI example
- English premise and English hypothesis
- English gold label (one of Entailment, Contradiction, Neutral)
- automatically translated premise and hypothesis to Czech
- Czech gold label (one of entailment, contradiction, neutral, bad translation)
- explanations for Czech label
Example record:
CSNLI ID: 4857558207.jpg#4r1e
English premise: A mother holds her newborn baby.
English hypothesis: A person holding a child.
English gold label: entailment
Czech premise: Matka drží své novorozené dítě.
Czech hypothesis: Osoba, která drží dítě.
Czech gold label: Entailment
Explanation-hypothesis: Matka
Explanation-premise: Osoba
Explanation-relation: generalization
Size of the explanations dataset:
- train: 159650
- dev: 2860
- test: 2880
Inter-Annotator Agreement (IAA)
Packages 1 and 12 annotate the same data. The IAA measured by the kappa score is 0.67 (substantial agreement).
The translation was performed via LINDAT translation service.
Next, the translated pairs were manually checked (without access to the original English gold label), with possible check of the original pair.
Explanations were annotated as follows:
- if there is a part of the premise or hypothesis that is relevant for the annotator's decision, it is marked
- if there are two such parts and there exists a relation between them, the relation is marked
Possible relation types:
- generalization: white long skirt - skirt
- specification: dog - bulldog
- similar: couch - sofa
- independence: they have no instruments - they belong to the group
- exclusion: man - woman
Original SNLI dataset: https://nlp.stanford.edu/projects/snli/
LINDAT Translation Service: https://lindat.mff.cuni.cz/services/translation/
Czech OOV Inflection Dataset is a Czech inflection dataset of nouns, focused on evaluation in out-of-vocabulary (OOV) conditions. It consists of two parts: a standard lemma-disjoint train-dev-test split of a subset of noun paradigms of existing morphological dictionary Czech MorfFlex 2.0 (files train, dev and test-MorfFlex); and small set of neologisms from Čeština 2.0, annotated for inflected forms (file test-neologisms).