CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:
- OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents.
The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series.
The nature of the bitexts are paraphrasing of each other's meaning, rather than translations.
- TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015.
The size of the original corpora collected from OPUS and TED talks is as follows:
CS/VI EN/VI
Sentence 1337199/1337199 2035624/2035624
Word 9128897/12073975 16638364/17565580
Unique word 224416/68237 91905/78333
We improve the quality of the corpora in two steps: normalizing and filtering.
In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly.
In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs.
The size of cleaned corpora as published is as follows:
CS/VI EN/VI
Sentence 1091058/1091058 1113177/1091058
Word 6718184/7646701 8518711/8140876
Unique word 195446/59737 69513/58286
The corpora are used as training data in [2].
References:
[1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey.
[2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
ELITR Minuting Corpus consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.
Czech meetings are in the computer science and public administration domains and English meetings are in the computer science domain.
Each transcript has one or multiple corresponding minutes files. Alignments are only provided for a portion of the data.
This corpus contains 59 Czech and 120 English meeting transcripts, consisting of 71097 and 87322 dialogue turns respectively. For Czech meetings, we provide 147 total minutes with 55 of them aligned. For English meetings, it is 256 total minutes with 111 of them aligned.
Please find a more detailed description of the data in the included README and stats.tsv files.
If you use this corpus, please cite:
Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O.
(2022). ELITR Minuting Corpus: A novel dataset for automatic minuting
from multi-party meetings in English and Czech. In Proceedings of the
13th International Conference on Language Resources and Evaluation
(LREC-2022), Marseille, France, June. European Language Resources
Association (ELRA). In print.
@inproceedings{elitr-minuting-corpus:2022,
author = {Anna Nedoluzhko and Muskaan Singh and Marie
Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar},
title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for
Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech},
booktitle = {Proceedings of the 13th International Conference
on Language Resources and Evaluation (LREC-2022)},
year = 2022,
month = {June},
address = {Marseille, France},
publisher = {European Language Resources Association (ELRA)},
note = {In print.}
}
Data
----
Hindi Visual Genome 1.1 is an updated version of Hindi Visual Genome 1.0. The update concerns primarily the text part of Hindi Visual Genome, fixing translation issues reported during WAT 2019 multimodal task. In the image part, only one segment and thus one image were removed from the dataset.
Hindi Visual Genome 1.1 serves in "WAT 2020 Multi-Modal Machine Translation Task".
Hindi Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account.
The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome.
A third test set is called ``challenge test set'' consists of 1.4K segments and it was released for WAT2019 multi-modal task. The challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word.
Dataset Formats
--------------
The multimodal dataset contains both text and images.
The text parts of the dataset (train and test sets) are in simple
tab-delimited plain text files.
All the text files have seven columns as follows:
Column1 - image_id
Column2 - X
Column3 - Y
Column4 - Width
Column5 - Height
Column6 - English Text
Column7 - Hindi Text
The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption.
Data Statistics
----------------
The statistics of the current release is given below.
Parallel Corpus Statistics
---------------------------
Dataset Segments English Words Hindi Words
------- --------- ---------------- -------------
Train 28930 143164 145448
Dev 998 4922 4978
Test 1595 7853 7852
Challenge Test 1400 8186 8639
------- --------- ---------------- -------------
Total 32923 164125 166917
The word counts are approximate, prior to tokenization.
Citation
--------
If you use this corpus, please cite the following paper:
@article{hindi-visual-genome:2019,
title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}},
author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan},
journal={Computaci{\'o}n y Sistemas},
volume={23},
number={4},
pages={1499--1505},
year={2019}
}
Data
----
We have collected English-Odia parallel and monolingual data from the
available public websites for NLP research in Odia.
The parallel corpus consists of English-Odia parallel Bible, Odia
digital library, and Odisha Goverment websites. It covers bible,
literature, goverment of Odisha and its policies. We have processed the
raw data collected from the websites, performed alignments (a mix of
manual and automatic alignments) and release the corpus in a form ready
for various NLP tasks.
The Odia monolingual data consists of Odia-Wikipedia and Odia e-magazine
websites. Because the major portion of data is extracted from
Odia-Wikipedia, it covers all kinds of domains. The e-magazines data
mostly cover the literature domain. We have preprocessed the monolingual
data including de-duplication, text normalization, and sentence
segmentation to make it ready for various NLP tasks.
Corpus Formats
--------------
Both corpora are in simple tab-delimited plain text files.
The parallel corpus files have three columns:
- the original book/source of the sentence pair
- the English sentence
- the corresponding Odia sentence
The monolingual corpus has a varying number of columns:
- each line corresponds to one *paragraph* (or related unit) of the
original source
- each tab-delimited unit corresponds to one *sentence* in the paragraph
Data Statistics
----------------
The statistics of the current release is given below.
Parallel Corpus Statistics
---------------------------
Dataset Sentences #English tokens #Odia tokens
------- --------- ---------------- -------------
Train 27136 706567 604147
Dev 948 21912 19513
Test 1262 28488 24365
------- --------- ---------------- -------------
Total 29346 756967 648025
Domain Level Statistics
------------------------
Domain Sentences #English tokens #Odia tokens
------------------ --------- ---------------- -------------
Bible 29069 756861 640157
Literature 424 7977 6611
Goverment policies 204 1411 1257
------------------ --------- ---------------- -------------
Total 29697 766249 648025
Monolingual Corpus Statistics
-----------------------------
Paragraphs Sentences #Odia tokens
---------- --------- ------------
71698 221546 2641308
Domain Level Statistics
-----------------------
Domain Paragraphs Sentences #Odia tokens
-------------- -------------- --------- -------------
General (wiki) 30468 (42.49%) 102085 1320367
Literature 41230 (57.50%) 119461 1320941
-------------- -------------- --------- -------------
Total 71698 221546 2641308
Citation
--------
If you use this corpus, please cite it directly (see above), but please cite also the following paper:
Title: OdiEnCorp: Odia-English and Odia-Only Corpus for Machine Translation
Author: Shantipriya Parida, Ondrej Bojar, and Satya Ranjan Dash
Proceedings of the Third International Conference on Smart Computing & Informatics (SCI) 2018
Series: Smart Innovation, Systems and Technologies (SIST)
Publisher: Springer Singapore
Data
-----
We have collected English-Odia parallel data for the purposes of NLP
research of the Odia language.
The data for the parallel corpus was extracted from existing parallel
corpora such as OdiEnCorp 1.0 and PMIndia, and books which contain both
English and Odia text such as grammar and bilingual literature books. We
also included parallel text from multiple public websites such as Odia
Wikipedia, Odia digital library, and Odisha Government websites.
The parallel corpus covers many domains: the Bible, other literature,
Wiki data relating to many topics, Government policies, and general
conversation. We have processed the raw data collected from the books,
websites, performed sentence alignments (a mix of manual and automatic
alignments) and released the corpus in a form suitable for various NLP
tasks.
Corpus Format
-------------
OdiEnCorp 2.0 is stored in simple tab-delimited plain text files, each
with three tab-delimited columns:
- a coarse indication of the domain
- the English sentence
- the corresponding Odia sentence
The corpus is shuffled at the level of sentence pairs.
The coarse domains are:
books ... prose text
dict ... dictionaries and phrasebooks
govt ... partially formal text
odiencorp10 ... OdiEnCorp 1.0 (mix of domains)
pmindia ... PMIndia (the original corpus)
wikipedia ... sentences and phrases from Wikipedia
Data Statistics
---------------
The statistics of the current release are given below.
Note that the statistics differ from those reported in the paper due to
deduplication at the level of sentence pairs. The deduplication was
performed within each of the dev set, test set and training set and
taking the coarse domain indication into account. It is still possible
that the same sentence pair appears more than once within the same set
(dev/test/train) if it came from different domains, and it is also
possible that a sentence pair appears in several sets (dev/test/train).
Parallel Corpus Statistics
--------------------------
Dev Dev Dev Test Test Test Train Train Train
Sents # EN # OD Sents # EN # OD Sents # EN # OD
books 3523 42011 36723 3895 52808 45383 3129 40461 35300
dict 3342 14580 13838 3437 14807 14110 5900 21591 20246
govt - - - - - - 761 15227 13132
odiencorp10 947 21905 19509 1259 28473 24350 26963 704114 602005
pmindia 3836 70282 61099 3836 68695 59876 30687 551657 486636
wikipedia 1896 9388 9385 1917 21381 20951 1930 7087 7122
Total 13544 158166 140554 14344 186164 164670 69370 1340137 1164441
"Sents" are the counts of the sentence pairs in the given set (dev/test/train)
and domain (books/dict/...).
"# EN" and "# OD" are approximate counts of words (simply space-delimited,
without tokenization) in English and Odia
The total number of sentence pairs (lines) is 13544+14344+69370=97258. Ignoring
the set and domain and deduplicating again, this number drops to 94857.
Citation
--------
If you use this corpus, please cite the following paper:
@inproceedings{parida2020odiencorp,
title={OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation},
author={Parida, Shantipriya and Dash, Satya Ranjan and Bojar, Ond{\v{r}}ej and Motlicek, Petr and Pattnaik, Priyanka and Mallick, Debasish Kumar},
booktitle={Proceedings of the WILDRE5--5th Workshop on Indian Language Data: Resources and Evaluation},
pages={14--19},
year={2020}
}
We define "optimal reference translation" as a translation thought to be the best possible that can be achieved by a team of human translators. Optimal reference translations can be used in assessments of excellent machine translations.
We selected 50 documents (online news articles, with 579 paragraphs in total) from the 130 English documents included in the WMT2020 news test (http://www.statmt.org/wmt20/) with the aim to preserve diversity (style, genre etc.) of the selection. In addition to the official Czech reference translation provided by the WMT organizers (P1), we hired two additional translators (P2 and P3, native Czech speakers) via a professional translation agency, resulting in three independent translations. The main contribution of this dataset are two additional translations (i.e. optimal reference translations N1 and N2), done jointly by two translators-cum-theoreticians with an extreme care for various aspects of translation quality, while taking into account the translations P1-P3. We publish also internal comments (in Czech) for some of the segments.
Translation N1 should be closer to the English original (with regards to the meaning and linguistic structure) and female surnames use the Czech feminine suffix (e.g. "Mai" is translated as "Maiová"). Translation N2 is more free, trying to be more creative, idiomatic and entertaining for the readers and following the typical style used in Czech media, while still preserving the rules of functional equivalence. Translation N2 is missing for the segments where it was not deemed necessary to provide two alternative translations. For applications/analyses needing translation of all segments, this should be interpreted as if N2 is the same as N1 for a given segment.
We provide the dataset in two formats: OpenDocument spreadsheet (odt) and plain text (one file for each translation and the English original). Some words were highlighted using different colors during the creation of optimal reference translations; this highlighting and comments are present only in the odt format (some comments refer to row numbers in the odt file). Documents are separated by empty lines and each document starts with a special line containing the document name (e.g. "# upi.205735"), which allows alignment with the original WMT2020 news test. For the segments where N2 translations are missing in the odt format, the respective N1 segments are used instead in the plain-text format.
The dataset used for the Ptakopět experiment on outbound machine translation. It consists of screenshots of web forms with user queries entered. The queries are available also in a text form. The dataset comprises two language versions: English and Czech. Whereas the English version has been fully post-processed (screenshots cropped, queries within the screenshots highlighted, dataset split based on its quality etc.), the Czech version is raw as it was collected by the annotators.
We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness, this record contains the 3000 sentences in all the WMT 2013 original languages (Czech, English, French, German, Russian and Spanish), extended with our Vietnamese version. Test set is used in [2] to evaluate translation between Czech, English and Vietnamese.
References
1. http://www.statmt.org/wmt13/evaluation-task.html
2. Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75--86, ISSN 1804-0462. 9/2015
The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.