Number of results to display per page
Search Results
12. DiscoMT 2015 Shared Task on Pronoun Translation
- Creator:
- Hardmeier, Christian, Tiedemann, Jörg, Nakov, Preslav, Stymne, Sara, and Versley, Yannick
- Publisher:
- Uppsala University
- Type:
- text and corpus
- Subject:
- machine translation, coreference resolution, anaphora resolution, and discourse
- Language:
- English and French
- Description:
- The data set includes training, development and test data from the shared tasks on pronoun-focused machine translation and cross-lingual pronoun prediction from the EMNLP 2015 workshop on Discourse in Machine Translation (DiscoMT2015). The release also contains the submissions to the pronoun-focused machine translation along with the manual annotations used for the official evaluation as well as gold-standard annotations of pronoun coreference for the shared task test set.
- Rights:
- Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB
13. DiscoMT 2016 Shared Task on Cross-lingual Pronoun Prediction
- Creator:
- Guillou, Liane, Hardmeier, Christian, Nakov, Preslav, Stymne, Sara, Tiedemann, Jörg, Versley, Yannick, Cettolo, Mauro, Webber, Bonnie, and Popescu-Belis, Andrei
- Publisher:
- Uppsala University
- Type:
- text and corpus
- Subject:
- machine translation, coreference, discourse, and pronouns
- Language:
- English, French, and German
- Description:
- Files for the DiscoMT 2016 shared task on cross-lingual pronoun prediction
- Rights:
- Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB
14. DiscoMT 2017 Shared Task on Cross-lingual Pronoun Prediction
- Creator:
- Loáiciga, Sharid, Stymne, Sara, Nakov, Preslav, Hardmeier, Christian, Tiedemann, Jörg, Cettolo, Mauro, and Versley, Yannick
- Publisher:
- Uppsala University
- Type:
- text and corpus
- Subject:
- machine translation, discourse, coreference, and pronouns
- Language:
- English, Spanish, German, and French
- Description:
- Data used in the 2017 shared task on cross-lingual pronoun prediction.
- Rights:
- Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB
15. EdUKate translation software 1
- Creator:
- Popel, Martin, Novák, Michal, Balhar, Jiří, Košarko, Ondřej, Mayer, Jiří, Poláková, Lucie, Kloudová, Věra, and Anisimova, Mariia
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- machine translation and Ukrainian
- Language:
- Ukrainian and Czech
- Description:
- This software package includes three tools: web frontend for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server and a tool for translation of documents with markup (html, docx, odt, pptx, odp,...). These tools are used in the Charles Translator service (https://translator.cuni.cz). This software was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools.
- Rights:
- BSD 2-Clause "Simplified" or "FreeBSD" license, http://opensource.org/licenses/BSD-2-Clause, and PUB
16. English-Urdu Religious Parallel Corpus
- Creator:
- Jawaid, Bushra and Zeman, Daniel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- parallel corpus, religious text, and machine translation
- Language:
- English and Urdu
- Description:
- English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
17. Extended CLEF eHealth 2013-2015 IR Test Collection
- Creator:
- Pecina, Pavel and Saleh, Shadi
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- cross-lingual information retrieval and machine translation
- Language:
- English, Czech, French, German, Hungarian, Polish, Spanish, and Swedish
- Description:
- This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it provides complete query translations into Czech, French, German, Hungarian, Polish, Spanish and Swedish and additional relevance assessment.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
18. FAUST cs-en 0.5
- Creator:
- Hajič, Jan, Mareček, David, Fučíková, Eva, Cinková, Silvie, Štěpánek, Jan, Mikulová, Marie, and Popel, Martin
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- noisy texts, parallel corpus, and machine translation
- Language:
- English and Czech
- Description:
- This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308). Each original (noisy) sentence was normalized (clean1 and clean2) and translated to English independently by two translators.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
19. Hausa Visual Genome 1.0
- Creator:
- Abdulmumin, Idris, Das, Satya Ranja, Dawud, Musa Abdullahi, Parida, Shantipriya, Muhammad, Shamsuddeen Hassan, Ahmad, Ibrahim Sa'id, Panda, Subhadarshi, Bojar, Ondřej, Galadanci, Bashir Shehu, and Bello, Bello Shehu
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- multi-modal, machine translation, image captioning, image annotation, and neural machine translation
- Language:
- Hausa and English
- Description:
- Data ------- Hausa Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hausa multimodal machine translation tasks and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as the dataset Hindi Visual Genome 1.1 has. We automatically translated the English captions to Hausa and manually post-edited, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. Additionally, a challenge test set of 1400 segments is available for the multi-modal task. This challenge test set was created in Hindi Visual Genome by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Dataset Formats ----------------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Hausa Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width, and Height columns indicate the rectangular region in the image described by the caption. Data Statistics -------------------- The statistics of the current release are given below. Parallel Corpus Statistics ----------------------------------- Dataset Segments English Words Hausa Words ---------- -------- ------------- ----------- Train 28930 143106 140981 Dev 998 4922 4857 Test 1595 7853 7736 Challenge Test 1400 8186 8752 ---------- -------- ------------- ----------- Total 32923 164067 162326 The word counts are approximate, prior to tokenization. Citation ----------- If you use this corpus, please cite the following paper: @InProceedings{abdulmumin-EtAl:2022:LREC, author = {Abdulmumin, Idris and Dash, Satya Ranjan and Dawud, Musa Abdullahi and Parida, Shantipriya and Muhammad, Shamsuddeen and Ahmad, Ibrahim Sa'id and Panda, Subhadarshi and Bojar, Ond{\v{r}}ej and Galadanci, Bashir Shehu and Bello, Bello Shehu}, title = "{Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation}", booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {6471--6479}, url = {https://aclanthology.org/2022.lrec-1.694} }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
20. Hindi Visual Genome 1.0
- Creator:
- Parida, Shantipriya and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- parallel corpus, corpus, multilingual, machine translation, shared task, English-Hindi parallel corpus, image captioning, and multi-modal
- Language:
- English and Hindi
- Description:
- Data ---- Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. Additionally, a challenge test set of 1400 segments will be released for the WAT2019 multi-modal task. This challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Dataset Formats -------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Hindi Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption. Data Statistics ---------------- The statistics of the current release is given below. Parallel Corpus Statistics --------------------------- Dataset Segments English Words Hindi Words ------- --------- ---------------- ------------- Train 28932 143178 136722 Dev 998 4922 4695 Test 1595 7852 7535 Challenge Test 1400 8185 8665 (Released separately) ------- --------- ---------------- ------------- Total 32925 164137 157617 The word counts are approximate, prior to tokenization. Citation -------- If you use this corpus, please cite the following paper: @article{hindi-visual-genome:2019, title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}}, author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan}, journal={Computaci{\'o}n y Sistemas}, note={In print. Presented at CICLing 2019, La Rochelle, France}, year={2019}, }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB