The dataset contains 4731 frozen continuous Czech multiword expressions. Inflectional word forms are generated for those MWEs where applicable. In total, the dataset contains 24,807 MWE forms.
The dataset contains two parts: the original Stanford Natural Language Inference (SNLI) dataset with automatic translations to Czech, for some items from the SNLI, it contains annotation of the Czech content and explanation.
The Czech SNLI data contain both Czech and English pairs premise-hypothesis. SNLI split into train/test/dev is preserved.
- CZtrainSNLI.csv: 550152 pairs
- CZtestSNLI.csv: 10000 pairs
- CZdevSNLI.csv: 10000 pairs
The explanation dataset contains batches of pairs premise-hypothesis. Each batch contains 1499 pairs. Each pair contains:
- reference to original SNLI example
- English premise and English hypothesis
- English gold label (one of Entailment, Contradiction, Neutral)
- automatically translated premise and hypothesis to Czech
- Czech gold label (one of entailment, contradiction, neutral, bad translation)
- explanations for Czech label
Example record:
CSNLI ID: 4857558207.jpg#4r1e
English premise: A mother holds her newborn baby.
English hypothesis: A person holding a child.
English gold label: entailment
Czech premise: Matka drží své novorozené dítě.
Czech hypothesis: Osoba, která drží dítě.
Czech gold label: Entailment
Explanation-hypothesis: Matka
Explanation-premise: Osoba
Explanation-relation: generalization
Size of the explanations dataset:
- train: 159650
- dev: 2860
- test: 2880
Inter-Annotator Agreement (IAA)
Packages 1 and 12 annotate the same data. The IAA measured by the kappa score is 0.67 (substantial agreement).
The translation was performed via LINDAT translation service.
Next, the translated pairs were manually checked (without access to the original English gold label), with possible check of the original pair.
Explanations were annotated as follows:
- if there is a part of the premise or hypothesis that is relevant for the annotator's decision, it is marked
- if there are two such parts and there exists a relation between them, the relation is marked
Possible relation types:
- generalization: white long skirt - skirt
- specification: dog - bulldog
- similar: couch - sofa
- independence: they have no instruments - they belong to the group
- exclusion: man - woman
Original SNLI dataset: https://nlp.stanford.edu/projects/snli/
LINDAT Translation Service: https://lindat.mff.cuni.cz/services/translation/
The aim of the course is to introduce digital humanities and to describe various aspects of digital content processing.
The course consists of 10 lessons with video material and a PowerPoint presentation with the same content.
Every lesson contains a practical session – either a Jupyter Notebook to work in Python or a text file with a short description of the task. Most of the practical tasks consist of running the programme and analyse the results.
Although the course does not focus on programming, the code can be reused easily in individual projects.
Some experience in running Python code is desirable but not required.
Annotation of named entities to the existing source Parallel Global Voices, ces-eng language pair. The named entity annotations distinguish four classes: Person, Organization, Location, Misc. The annotation is in the IOB schema (annotation per token, beginning + inside of the multi-word annotation). NEL annotation contains Wikidata Qnames.