En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip).
Their main use should be in-domain translation of social surveys.
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on MCSQ test set (BLEU):
en->ru: 64.3 (train: genuine in-domain MCSQ data)
ru->en: 74.7 (train: additional backtranslated in-domain MCSQ data)
(Evaluated using multeval: https://github.com/jhclark/multeval)
Data
-----
We have collected English-Odia parallel data for the purposes of NLP
research of the Odia language.
The data for the parallel corpus was extracted from existing parallel
corpora such as OdiEnCorp 1.0 and PMIndia, and books which contain both
English and Odia text such as grammar and bilingual literature books. We
also included parallel text from multiple public websites such as Odia
Wikipedia, Odia digital library, and Odisha Government websites.
The parallel corpus covers many domains: the Bible, other literature,
Wiki data relating to many topics, Government policies, and general
conversation. We have processed the raw data collected from the books,
websites, performed sentence alignments (a mix of manual and automatic
alignments) and released the corpus in a form suitable for various NLP
tasks.
Corpus Format
-------------
OdiEnCorp 2.0 is stored in simple tab-delimited plain text files, each
with three tab-delimited columns:
- a coarse indication of the domain
- the English sentence
- the corresponding Odia sentence
The corpus is shuffled at the level of sentence pairs.
The coarse domains are:
books ... prose text
dict ... dictionaries and phrasebooks
govt ... partially formal text
odiencorp10 ... OdiEnCorp 1.0 (mix of domains)
pmindia ... PMIndia (the original corpus)
wikipedia ... sentences and phrases from Wikipedia
Data Statistics
---------------
The statistics of the current release are given below.
Note that the statistics differ from those reported in the paper due to
deduplication at the level of sentence pairs. The deduplication was
performed within each of the dev set, test set and training set and
taking the coarse domain indication into account. It is still possible
that the same sentence pair appears more than once within the same set
(dev/test/train) if it came from different domains, and it is also
possible that a sentence pair appears in several sets (dev/test/train).
Parallel Corpus Statistics
--------------------------
Dev Dev Dev Test Test Test Train Train Train
Sents # EN # OD Sents # EN # OD Sents # EN # OD
books 3523 42011 36723 3895 52808 45383 3129 40461 35300
dict 3342 14580 13838 3437 14807 14110 5900 21591 20246
govt - - - - - - 761 15227 13132
odiencorp10 947 21905 19509 1259 28473 24350 26963 704114 602005
pmindia 3836 70282 61099 3836 68695 59876 30687 551657 486636
wikipedia 1896 9388 9385 1917 21381 20951 1930 7087 7122
Total 13544 158166 140554 14344 186164 164670 69370 1340137 1164441
"Sents" are the counts of the sentence pairs in the given set (dev/test/train)
and domain (books/dict/...).
"# EN" and "# OD" are approximate counts of words (simply space-delimited,
without tokenization) in English and Odia
The total number of sentence pairs (lines) is 13544+14344+69370=97258. Ignoring
the set and domain and deduplicating again, this number drops to 94857.
Citation
--------
If you use this corpus, please cite the following paper:
@inproceedings{parida2020odiencorp,
title={OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation},
author={Parida, Shantipriya and Dash, Satya Ranjan and Bojar, Ond{\v{r}}ej and Motlicek, Petr and Pattnaik, Priyanka and Mallick, Debasish Kumar},
booktitle={Proceedings of the WILDRE5--5th Workshop on Indian Language Data: Resources and Evaluation},
pages={14--19},
year={2020}
}
The dataset used for the Ptakopět experiment on outbound machine translation. It consists of screenshots of web forms with user queries entered. The queries are available also in a text form. The dataset comprises two language versions: English and Czech. Whereas the English version has been fully post-processed (screenshots cropped, queries within the screenshots highlighted, dataset split based on its quality etc.), the Czech version is raw as it was collected by the annotators.
AMALACH project component TMODS:ENG-CZE; machine translation of queries from Czech to English. This archive contains models for the Moses decoder (binarized, pruned to allow for real-time translation) and configuration files for the MTMonkey toolkit. The aim of this package is to provide a full service for Czech->English translation which can be easily utilized as a component in a larger software solution. (The required tools are freely available and an installation guide is included in the package.)
The translation models were trained on CzEng 1.0 corpus and Europarl. Monolingual data for LM estimation additionally contains WMT news crawls until 2013.
En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2020 (BLEU):
en->de: 25.9
de->en: 33.4
(Evaluated using multeval: https://github.com/jhclark/multeval)
En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2020 (BLEU):
en->ru: 18.0
ru->en: 30.4
(Evaluated using multeval: https://github.com/jhclark/multeval)
This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the Welcome project (https://welcome-h2020.eu/). The corpus consists of 120,600 multiparallel sentences in English, French, German, Greek, Spanish, and Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation for North Levantine and the other languages.
The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50).
The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect.
The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID.
The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "Tar_13052022_Czechia-01.wav" and "Tar_13052022_Czechia-02.wav".
The data provided in this repository corresponds to the validation split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50).
The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect.
The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID.
The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "16072022_Family-01.wav" and "16072022_Family-02.wav".
The data provided in this repository corresponds to the test split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.