dc.contributor.author | Víta, Martin |
dc.contributor.author | Nevěřilová, Zuzana |
dc.date.accessioned | 2024-07-10T09:34:28Z |
dc.date.available | 2024-07-10T09:34:28Z |
dc.date.issued | 2024 |
dc.identifier.uri | http://hdl.handle.net/11234/1-5548 |
dc.description | The dataset contains two parts: the original Stanford Natural Language Inference (SNLI) dataset with automatic translations to Czech, for some items from the SNLI, it contains annotation of the Czech content and explanation. The Czech SNLI data contain both Czech and English pairs premise-hypothesis. SNLI split into train/test/dev is preserved. - CZtrainSNLI.csv: 550152 pairs - CZtestSNLI.csv: 10000 pairs - CZdevSNLI.csv: 10000 pairs The explanation dataset contains batches of pairs premise-hypothesis. Each batch contains 1499 pairs. Each pair contains: - reference to original SNLI example - English premise and English hypothesis - English gold label (one of Entailment, Contradiction, Neutral) - automatically translated premise and hypothesis to Czech - Czech gold label (one of entailment, contradiction, neutral, bad translation) - explanations for Czech label Example record: CSNLI ID: 4857558207.jpg#4r1e English premise: A mother holds her newborn baby. English hypothesis: A person holding a child. English gold label: entailment Czech premise: Matka drží své novorozené dítě. Czech hypothesis: Osoba, která drží dítě. Czech gold label: Entailment Explanation-hypothesis: Matka Explanation-premise: Osoba Explanation-relation: generalization Size of the explanations dataset: - train: 159650 - dev: 2860 - test: 2880 Inter-Annotator Agreement (IAA) Packages 1 and 12 annotate the same data. The IAA measured by the kappa score is 0.67 (substantial agreement). The translation was performed via LINDAT translation service. Next, the translated pairs were manually checked (without access to the original English gold label), with possible check of the original pair. Explanations were annotated as follows: - if there is a part of the premise or hypothesis that is relevant for the annotator's decision, it is marked - if there are two such parts and there exists a relation between them, the relation is marked Possible relation types: - generalization: white long skirt - skirt - specification: dog - bulldog - similar: couch - sofa - independence: they have no instruments - they belong to the group - exclusion: man - woman Original SNLI dataset: https://nlp.stanford.edu/projects/snli/ LINDAT Translation Service: https://lindat.mff.cuni.cz/services/translation/ |
dc.language.iso | ces |
dc.publisher | Masaryk University, NLP Centre |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-sa/4.0/ |
dc.subject | natural language inference |
dc.subject | textual entailment |
dc.title | Czech Natural Language Inference Dataset with Explanations |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Zuzana Nevěřilová xpopelk@fi.muni.cz Masaryk University, NLP Centre |
size.info | 17210 entries |
files.size | 148973204 |
files.count | 16 |
Soubory tohoto záznamu
Stáhnout všechny soubory záznamu (142.07 MB)Licenční kategorie:
Licence: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
Licence: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Název
- CZtrainSNLI.csv
- Velikost
- 132.19 MB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 33b2b79cbfe81c6a22eff078b4c51ade
- Název
- CZdevSNLI.csv
- Velikost
- 2.53 MB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 90961cb149fc6fbac1b810f20f11afbc
- Název
- CZtestSNLI.csv
- Velikost
- 2.52 MB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 328bfc8aba8f10ca68502d7bbf3ca1d6
- Název
- README.md
- Velikost
- 3.8 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- bfd9189b667e1c14f0ac9e7be0c200d3
- Název
- CSSNLI_relations_1.csv
- Velikost
- 407.42 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 32cf8f558b5d1c2fd88e84bcbd2f95fa
- Název
- CSSNLI_relations_2.csv
- Velikost
- 477.63 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- b670098020d71c016ffa48d16c6b5592
- Název
- CSSNLI_relations_3.csv
- Velikost
- 467.25 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 50cbf348d7b06f6fe3397d83418fa28b
- Název
- CSSNLI_relations_4.csv
- Velikost
- 365.28 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 45ad1d9920009ec5fb1f113364d08a03
- Název
- CSSNLI_relations_5.csv
- Velikost
- 331.37 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- f6cbd2ecb1dbf67d6c70e0b0821b8524
- Název
- CSSNLI_relations_6.csv
- Velikost
- 491.68 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- fba0eea1354708e2a5701b3f7d58893d
- Název
- CSSNLI_relations_7.csv
- Velikost
- 599.63 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 1759ff46a7980687cd2aaf5142474e1b
- Název
- CSSNLI_relations_8.csv
- Velikost
- 333.67 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 3cc6dd4eec76c6f1843e7153b3e86c16
- Název
- CSSNLI_relations_9.csv
- Velikost
- 407.17 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- f8974890895b2885ba05a443641a7372
- Název
- CSSNLI_relations_10.csv
- Velikost
- 256.53 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 8c9c43bca68e613b207cc39782da5dbe
- Název
- CSSNLI_relations_11.csv
- Velikost
- 414.12 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 97cce46e20aac1dc7553a28236073a0b
- Název
- CSSNLI_relations_12.csv
- Velikost
- 396.54 KB
- Formát
- Neznámý
- Popis
- Unknown
- MD5
- 0af176ef75b7b689b30f086b3e6b8c1c