This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate).
The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied.
The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for evaluation of automatic speech recognition (ASR) systems, especially in conditions where the prior availability of in-domain vocabulary and named entities is benefitable.
The corpus consists of 39 presentations in English, each up to 90 seconds long, and slides and web-pages in Czech, Slovak, English, German, Romanian, Italian or Spanish.
The speakers are high school students from European countries with English as their second language.
We benchmark three baseline ASR systems on the corpus and show their imperfection.
A dataset intended for fully trainable natural language generation (NLG) systems in task-oriented spoken dialogue systems (SDS), covering the English public transport information domain. It includes preceding context (user utterance) along with each data instance (pair of source meaning representation and target natural language paraphrase to be generated).
Taking the form of the previous user utterance into account for generating the system response allows NLG systems trained on this dataset to entrain (adapt) to the preceding utterance, i.e., reuse wording and syntactic structure. This should presumably improve the perceived naturalness of the output, and may even lead to a higher task success rate.
Crowdsourcing has been used to obtain natural context user utterances as well as natural system responses to be generated.
Artificially created treebank of elliptical constructions (gapping), in the annotation style of Universal Dependencies. Data taken from UD 2.1 release, and from large web corpora parsed by two parsers. Input data are filtered, sentences are identified where gapping could be applied, then those sentences are transformed, one or more words are omitted, resulting in a sentence with gapping. Details in Droganova et al.: Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions, LREC 2018, Miyazaki, Japan.
Data
-------
Bengali Visual Genome (BVG for short) 1.0 has similar goals as Hindi Visual Genome (HVG) 1.1: to support the Bengali language. Bengali Visual Genome 1.0 is the multi-modal dataset in Bengali for machine translation and image
captioning. Bengali Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Bengali multimodal machine translation tasks and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as HGV 1.1 has. For BVG, we manually translated these captions from English to Bengali taking the associated images into account. The manual translation is performed by the native Bengali speakers without referring to any machine translation system.
The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. A third test set is
called the ``challenge test set'' and consists of 1.4K segments. The challenge test set was created for the WAT2019 multi-modal task by searching for (particularly) ambiguous English words based on the embedding similarity and
manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word.
Dataset Formats
---------------
The multimodal dataset contains both text and images.
The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files.
All the text files have seven columns as follows:
Column1 - image_id
Column2 - X
Column3 - Y
Column4 - Width
Column5 - Height
Column6 - English Text
Column7 - Bengali Text
The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption.
Data Statistics
---------------
The statistics of the current release are given below.
Parallel Corpus Statistics
--------------------------
Dataset Segments English Words Bengali Words
---------- -------- ------------- -------------
Train 28930 143115 113978
Dev 998 4922 3936
Test 1595 7853 6408
Challenge Test 1400 8186 6657
---------- -------- ------------- -------------
Total 32923 164076 130979
The word counts are approximate, prior to tokenization.
Citation
--------
If you use this corpus, please cite the following paper:
@inproceedings{hindi-visual-genome:2022,
title= "{Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning}",
author={Sen, Arghyadeep
and Parida, Shantipriya
and Kotwal, Ketan
and Panda, Subhadarshi
and Bojar, Ond{\v{r}}ej
and Dash, Satya Ranjan},
editor={Satapathy, Suresh Chandra
and Peer, Peter
and Tang, Jinshan
and Bhateja, Vikrant
and Ghosh, Anumoy},
booktitle= {Intelligent Data Engineering and Analytics},
publisher= {Springer Nature Singapore},
address= {Singapore},
pages = {63--70},
isbn = {978-981-16-6624-7},
doi = {10.1007/978-981-16-6624-7_7},
}
CoDipA UNSC 1.0, or a Corpus of Diplomatic Attitudes of the United Nations Security Council is a language resource manually annotated with the attitude-part of Appraisal theory. The speeches were selected according to topic-related
and temporal criteria, and are representative of 5 major international military conflicts that have occurred between 1995 and 2020. The texts were annotated according to the predefined annotation scenario, which is based on the original Appraisal theory and later available commentaries on specificity of its implementation.
The annotated texts are available in JSON Lines format. The corpus also contains double annotations of the 8 selected speeches.
CoNLL 2017 and 2018 shared tasks:
Multilingual Parsing from Raw Text to Universal Dependencies
This package contains the test data in the form in which they ware presented
to the participating systems: raw text files and files preprocessed by UDPipe.
The metadata.json files contain lists of files to process and to output;
README files in the respective folders describe the syntax of metadata.json.
For full training, development and gold standard test data, see
Universal Dependencies 2.0 (CoNLL 2017)
Universal Dependencies 2.2 (CoNLL 2018)
See the download links at http://universaldependencies.org/.
For more information on the shared tasks, see
http://universaldependencies.org/conll17/
http://universaldependencies.org/conll18/
Contents:
conll17-ud-test-2017-05-09 ... CoNLL 2017 test data
conll18-ud-test-2018-05-06 ... CoNLL 2018 test data
conll18-ud-test-2018-05-06-for-conll17 ... CoNLL 2018 test data with metadata
and filenames modified so that it is digestible by the 2017 systems.
CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 0.1 consists of 17 datasets for 11 languages.
The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column.
The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data.
The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets.
When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too.
References to original resources whose harmonized versions are contained in the public edition of CorefUD 0.1:
- Catalan-AnCora:
Recasens, M. and Martí, M. A. (2010). AnCora-CO: Coreferentially Annotated Corpora for Spanish and Catalan. Language Resources and Evaluation, 44(4):315–345
- Czech-PCEDT:
Nedoluzhko, A., Novák, M., Cinková, S., Mikulová, M., and Mírovský, J. (2016). Coreference in Prague Czech-English Dependency Treebank. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 169–176, Portorož, Slovenia. European Language Resources Association.
- Czech-PDT:
Hajič, J., Bejček, E., Hlaváčová, J., Mikulová, M., Straka, M., Štěpánek, J., and Štěpánková, B. (2020). Prague Dependency Treebank - Consolidated 1.0. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pages 5208–5218, Marseille, France. European Language Resources Association.
- English-GUM:
Zeldes, A. (2017). The GUM Corpus: Creating Multilayer Resources in the Classroom. Language Resources and Evaluation, 51(3):581–612.
- English-ParCorFull:
Lapshinova-Koltunski, E., Hardmeier, C., and Krielke, P. (2018). ParCorFull: a Parallel Corpus Annotated with Full Coreference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association.
- French-Democrat:
Landragin, F. (2016). Description, modélisation et détection automatique des chaı̂nes de référence (DEMOCRAT). Bulletin de l’Association Française pour l’Intelligence Artificielle, (92):11–15.
- German-ParCorFull:
Lapshinova-Koltunski, E., Hardmeier, C., and Krielke, P. (2018). ParCorFull: a Parallel Corpus Annotated with Full Coreference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association
- German-PotsdamCC:
Bourgonje, P. and Stede, M. (2020). The Potsdam Commentary Corpus 2.2: Extending annotations for shallow discourse parsing. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1061–1066, Marseille, France. European Language Resources Association.
- Hungarian-SzegedKoref:
Vincze, V., Hegedűs, K., Sliz-Nagy, A., and Farkas, R. (2018). SzegedKoref: A Hungarian Coreference Corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association.
- Lithuanian-LCC:
Žitkus, V. and Butkienė, R. (2018). Coreference Annotation Scheme and Corpus for Lithuanian Language. In Fifth International Conference on Social Networks Analysis, Management and Security, SNAMS 2018, Valencia, Spain, October 15-18, 2018, pages 243–250. IEEE.
- Polish-PCC:
Ogrodniczuk, M., Glowińska, K., Kopeć, M., Savary, A., and Zawisławska, M. (2013). Polish coreference corpus. In Human Language Technology. Challenges for Computer Science and Linguistics - 6th Language and Technology Conference, LTC 2013, Poznań, Poland, December 7-9, 2013. Revised Selected Papers, volume 9561 of Lecture Notes in Computer Science, pages 215–226. Springer.
- Russian-RuCor:
Toldova, S., Roytberg, A., Ladygina, A. A., Vasilyeva, M. D., Azerkovich, I. L., Kurzukov,M., Sim, G., Gorshkov, D. V., Ivanova, A., Nedoluzhko, A., and Grishina, Y. (2014). Evaluating Anaphora and Coreference Resolution for Russian. In Komp’juternaja lingvistika i intellektual’nye tehnologii. Po materialam ezhegodnoj Mezhdunarodnoj konferencii
Dialog, pages 681–695.
- Spanish-AnCora:
Recasens, M. and Martí, M. A. (2010). AnCora-CO: Coreferentially Annotated Corpora for Spanish and Catalan. Language Resources and Evaluation, 44(4):315–345
References to original resources whose harmonized versions are contained in the ÚFAL-internal edition of CorefUD 0.1:
- Dutch-COREA:
Hendrickx, I., Bouma, G., Coppens, F., Daelemans, W., Hoste, V., Kloosterman, G., Mineur, A.-M., Van Der Vloet, J., and Verschelde, J.-L. (2008). A coreference corpus and resolution system for Dutch. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association.
- English-ARRAU:
Uryupina, O., Artstein, R., Bristot, A., Cavicchio, F., Delogu, F., Rodriguez, K. J., and Poesio, M. (2020). Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus. Natural Language Engineering, 26(1):95–128.
- English-OntoNotes:
Weischedel, R., Hovy, E., Marcus, M., Palmer, M., Belvin, R., Pradhan, S., Ramshaw, L., and Xue, N. (2011). Ontonotes: A large training corpus for enhanced processing. In Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, pages 54–63, New York. Springer-Verlag.
- English-PCEDT:
Nedoluzhko, A., Novák, M., Cinková, S., Mikulová, M., and Mírovský, J. (2016). Coreference in Prague Czech-English Dependency Treebank. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 169–176, Portorož, Slovenia. European Language Resources Association.