Transcripts of longitudinal audio recordings of 7 Czech typical monolingual children between 1;7 to 3;9. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the presudonym of the child and her age at the given session in form YMMDD. Transcription rules and other details are to find on the homepage coczefla.ff.cuni.cz.
A new version of the previously published corpus Chroma. The version 2023.04 includes six children. Two transcripts (Julie20221, Klara30424) were removed since they did not meet the criteria on the dialogical format. The transcripts were revised (eliminating typing errors and inconsistencies in the transcription format) and morphologically annotated by the automatic tool MorphoDiTa. Detailed manual control of the annotation was performed on children's utterances; the annotation of adult data was not checked yet. Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the alias of the child and their age at the given session in form YMMDD. Transcription rules and other details can be found on the homepage coczefla.ff.cuni.cz.
A new version of the previously published corpus Chroma wih morphological annotation. The version 2023.07 differs from 2023.04 in that it includes all seven children and it went through an additional careful check of consistency and conformity to the CHAT transcription principles.
Two transcripts (Julie20221, Klara30424) from the previous versions (2022.07, 2019.07) were removed since they did not meet our criteria on dialogical format. All transcripts of recordings made during one day were split into one file. Thus, version 2023.07 consists of 183 files/transcripts. The number of utterances and tokens given here in LINDAT corresponds to children's lines only.
Files are in plain text with UTF-8 encoding. Each file represents one recording session of one of the target children and is named with the alias of the child and their age at the given session in form YMMDD. Transcription rules and other details can be found on the homepage coczefla.ff.cuni.cz.
The Czech translation of SQuAD 2.0 and SQuAD 1.1 datasets contains automatically translated texts, questions and answers from the training set and the development set of the respective datasets.
The test set is missing, because it is not publicly available.
The data is released under the CC BY-NC-SA 4.0 license.
If you use the dataset, please cite the following paper (the exact format was not available during the submission of the dataset): Kateřina Macková and Straka Milan: Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer, presented at TSD 2020, Brno, Czech Republic, September 8-11 2020.
ELITR Minuting Corpus consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.
Czech meetings are in the computer science and public administration domains and English meetings are in the computer science domain.
Each transcript has one or multiple corresponding minutes files. Alignments are only provided for a portion of the data.
This corpus contains 59 Czech and 120 English meeting transcripts, consisting of 71097 and 87322 dialogue turns respectively. For Czech meetings, we provide 147 total minutes with 55 of them aligned. For English meetings, it is 256 total minutes with 111 of them aligned.
Please find a more detailed description of the data in the included README and stats.tsv files.
If you use this corpus, please cite:
Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O.
(2022). ELITR Minuting Corpus: A novel dataset for automatic minuting
from multi-party meetings in English and Czech. In Proceedings of the
13th International Conference on Language Resources and Evaluation
(LREC-2022), Marseille, France, June. European Language Resources
Association (ELRA). In print.
@inproceedings{elitr-minuting-corpus:2022,
author = {Anna Nedoluzhko and Muskaan Singh and Marie
Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar},
title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for
Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech},
booktitle = {Proceedings of the 13th International Conference
on Language Resources and Evaluation (LREC-2022)},
year = 2022,
month = {June},
address = {Marseille, France},
publisher = {European Language Resources Association (ELRA)},
note = {In print.}
}
This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308).
Each original (noisy) sentence was normalized (clean1 and clean2) and translated to English independently by two translators.
Data
-------
Hausa Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hausa multimodal machine translation tasks and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as the dataset Hindi Visual Genome 1.1 has. We automatically translated the English captions to Hausa and manually post-edited, taking the associated images into account.
The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome.
Additionally, a challenge test set of 1400 segments is available for the multi-modal task. This challenge test set was created in Hindi Visual Genome by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity.
Dataset Formats
-----------------------
The multimodal dataset contains both text and images.
The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files.
All the text files have seven columns as follows:
Column1 - image_id
Column2 - X
Column3 - Y
Column4 - Width
Column5 - Height
Column6 - English Text
Column7 - Hausa Text
The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width, and Height columns indicate the rectangular region in the image described by the caption.
Data Statistics
--------------------
The statistics of the current release are given below.
Parallel Corpus Statistics
-----------------------------------
Dataset Segments English Words Hausa Words
---------- -------- ------------- -----------
Train 28930 143106 140981
Dev 998 4922 4857
Test 1595 7853 7736
Challenge Test 1400 8186 8752
---------- -------- ------------- -----------
Total 32923 164067 162326
The word counts are approximate, prior to tokenization.
Citation
-----------
If you use this corpus, please cite the following paper:
@InProceedings{abdulmumin-EtAl:2022:LREC,
author = {Abdulmumin, Idris
and Dash, Satya Ranjan
and Dawud, Musa Abdullahi
and Parida, Shantipriya
and Muhammad, Shamsuddeen
and Ahmad, Ibrahim Sa'id
and Panda, Subhadarshi
and Bojar, Ond{\v{r}}ej
and Galadanci, Bashir Shehu
and Bello, Bello Shehu},
title = "{Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation}",
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {6471--6479},
url = {https://aclanthology.org/2022.lrec-1.694}
}
A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ.
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2018101@@LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy@@nationalFunds@@✖[remove]8