Rights: PUB - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Rights PUB Date 2024

1. Czech Natural Language Inference Dataset with Explanations

Creator:: Víta, Martin and Nevěřilová, Zuzana
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: natural language inference and textual entailment
Language:: Czech
Description:: The dataset contains two parts: the original Stanford Natural Language Inference (SNLI) dataset with automatic translations to Czech, for some items from the SNLI, it contains annotation of the Czech content and explanation. The Czech SNLI data contain both Czech and English pairs premise-hypothesis. SNLI split into train/test/dev is preserved. - CZtrainSNLI.csv: 550152 pairs - CZtestSNLI.csv: 10000 pairs - CZdevSNLI.csv: 10000 pairs The explanation dataset contains batches of pairs premise-hypothesis. Each batch contains 1499 pairs. Each pair contains: - reference to original SNLI example - English premise and English hypothesis - English gold label (one of Entailment, Contradiction, Neutral) - automatically translated premise and hypothesis to Czech - Czech gold label (one of entailment, contradiction, neutral, bad translation) - explanations for Czech label Example record: CSNLI ID: 4857558207.jpg#4r1e English premise: A mother holds her newborn baby. English hypothesis: A person holding a child. English gold label: entailment Czech premise: Matka drží své novorozené dítě. Czech hypothesis: Osoba, která drží dítě. Czech gold label: Entailment Explanation-hypothesis: Matka Explanation-premise: Osoba Explanation-relation: generalization Size of the explanations dataset: - train: 159650 - dev: 2860 - test: 2880 Inter-Annotator Agreement (IAA) Packages 1 and 12 annotate the same data. The IAA measured by the kappa score is 0.67 (substantial agreement). The translation was performed via LINDAT translation service. Next, the translated pairs were manually checked (without access to the original English gold label), with possible check of the original pair. Explanations were annotated as follows: - if there is a part of the premise or hypothesis that is relevant for the annotator's decision, it is marked - if there are two such parts and there exists a relation between them, the relation is marked Possible relation types: - generalization: white long skirt - skirt - specification: dog - bulldog - similar: couch - sofa - independence: they have no instruments - they belong to the group - exclusion: man - woman Original SNLI dataset: https://nlp.stanford.edu/projects/snli/ LINDAT Translation Service: https://lindat.mff.cuni.cz/services/translation/
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

2. Czech OOV Inflection Dataset

Creator:: Sourada, Tomáš
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, computationalLexicon, and lexicalConceptualResource
Subject:: morphological generation, morphology, neologisms database, and Czech
Language:: Czech
Description:: Czech OOV Inflection Dataset is a Czech inflection dataset of nouns, focused on evaluation in out-of-vocabulary (OOV) conditions. It consists of two parts: a standard lemma-disjoint train-dev-test split of a subset of noun paradigms of existing morphological dictionary Czech MorfFlex 2.0 (files train, dev and test-MorfFlex); and small set of neologisms from Čeština 2.0, annotated for inflected forms (file test-neologisms).
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

3. The Use of Machine Translation by Ukrainian War Refugees in Czechia

Creator:: Agapova, Anna and Špačková, Stanislava
Publisher:: Oxford University Press
Type:: TEXT and Spreadsheet
Subject:: machine translation, migration, Ukrainian refugees, Russo-Ukrainian war, and Czech Republic
Language:: Ukrainian and Russian
Description:: Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented spreadsheet contains minimally processed data exported from the two questionnaires that were created in Google Forms in the Ukrainian and the Russian language. The links to these questionnaires were distributed by three methods: direct email to particular refugees whose contact details the authors obtained while volunteering; through a non-profit organisation helping refugees (Vesna women’s education institution) and on social networks by posting links to the survey in groups associating the Ukrainian community across Czech regions and towns. Since we asked potential respondents to spread the questionnaire further, we could not prevent it from reaching Ukrainians who had arrived in Czechia previously, or received temporary protection in other countries. Due to this fact, the textual answers to the question 1.5 "Which country are you in right now?" were replaced in the dataset by numbers (1 for the Czech Republic, 2 for other countries) in order for us to be able to separate the data of respondents not located in the Czech Republic, which were irrelevant for our survey. Also, in this version of the dataset, the textual answers to the question 1.6 "How many months have you been to this country?" were replaced by numbers, so that we could separate the data of respondents who arrived in the Czech Republic in February 2022 or later from the other data (0 for those staying in Czechia before February 2022, 1 for those staying in Czechia since February 2022 or later, 2 for those staying in other countries).
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

4. Treebanks for Unified Taxonomy of Deep Syntactic Relations

Creator:: Droganova, Kira and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: treebank and semantic roles
Language:: Czech, Spanish, Catalan, and Finnish
Description:: The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024. Four languages are included in this release. English PropBank is omitted due to its license terms.
Rights:: Licence Universal Dependencies v2.14, https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.14, and PUB

5. Word Importance Dataset

Creator:: Osuský, Adam and Javorský, Dávid
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: word importance, ranking, and importance ranking
Language:: English
Description:: This dataset comprises a corpus of 50 text contexts, each about 60 words in length, sourced from five distinct domains. Each context has been evaluated by multiple annotators who identified and ranked the most important words—up to 10% of each text—according to their perceived significance. The annotators followed specific guidelines to ensure consistency in word selection and ranking. For further details, please refer to the cited source. --- rankings_task.csv - This csv contains information about the contexts which are to be annotated: - id: A unique identifier for each task. - content: The context to be ranked. --- rankings_ranking.csv - This csv includes ranking information for various assignments. It contains four columns: - id: A unique identifier for each ranking entry. - score: The score assigned to the entry. - word_order: A JSON detailing the order of words positions. It is essentially the selected word positions and their ordering from an annotator. - assignment_id: A reference ID linking to the assignments. --- rankings_assignment.csv - This csv tracks the completion status of tasks by users. It includes four columns: - id: A unique identifier for each assignment entry. - is_completed: A binary indicator (1 for completed, 0 for not completed). - task_id: A reference ID linking to the tasks. - user_id: The identifier for the user who should complete the task (rank the words). --- Known Issues: Please note that each annotator was intended to rank each context only once. However, due to a bug in the deployment of the annotation tool, some entries may be duplicated. Users of this dataset should be cautious of this issue and verify the uniqueness of the annotations where necessary. --- This dataset is a part of work from a bachelor thesis: OSUSKÝ, Adam. Predicting Word Importance Using Pre-Trained Language Models. Bachelor thesis, supervisor Javorský, Dávid. Prague: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, 2024.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

1. Czech Natural Language Inference Dataset with Explanations

2. Czech OOV Inflection Dataset

3. The Use of Machine Translation by Ukrainian War Refugees in Czechia

4. Treebanks for Unified Taxonomy of Deep Syntactic Relations

5. Word Importance Dataset

Limit your search

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Creator

Language

Publisher

Rights

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from