AMALACH project component TMODS:ENG-CZE; machine translation of queries from Czech to English. This archive contains models for the Moses decoder (binarized, pruned to allow for real-time translation) and configuration files for the MTMonkey toolkit. The aim of this package is to provide a full service for Czech->English translation which can be easily utilized as a component in a larger software solution. (The required tools are freely available and an installation guide is included in the package.)
The translation models were trained on CzEng 1.0 corpus and Europarl. Monolingual data for LM estimation additionally contains WMT news crawls until 2013.
En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2020 (BLEU):
en->de: 25.9
de->en: 33.4
(Evaluated using multeval: https://github.com/jhclark/multeval)
En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2020 (BLEU):
en->ru: 18.0
ru->en: 30.4
(Evaluated using multeval: https://github.com/jhclark/multeval)
Treex::Web is a web frontend for running Treex applications from your browser.
Treex (formerly TectoMT) is a highly modular NLP framework implemented in Perl programming language. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project.
This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the Welcome project (https://welcome-h2020.eu/). The corpus consists of 120,600 multiparallel sentences in English, French, German, Greek, Spanish, and Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation for North Levantine and the other languages.
The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50).
The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect.
The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID.
The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "Tar_13052022_Czechia-01.wav" and "Tar_13052022_Czechia-02.wav".
The data provided in this repository corresponds to the validation split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50).
The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect.
The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID.
The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "16072022_Family-01.wav" and "16072022_Family-02.wav".
The data provided in this repository corresponds to the test split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented spreadsheet contains minimally processed data exported from the two questionnaires that were created in Google Forms in the Ukrainian and the Russian language. The links to these questionnaires were distributed by three methods: direct email to particular refugees whose contact details the authors obtained while volunteering; through a non-profit organisation helping refugees (Vesna women’s education institution) and on social networks by posting links to the survey in groups associating the Ukrainian community across Czech regions and towns.
Since we asked potential respondents to spread the questionnaire further, we could not prevent it from reaching Ukrainians who had arrived in Czechia previously, or received temporary protection in other countries. Due to this fact, the textual answers to the question 1.5 "Which country are you in right now?" were replaced in the dataset by numbers (1 for Czech Republic, 2 for other countries) in order for us to be able to separate the data of respondents not located in the Czech Republic, which were irrelevant for our survey.
Training, development and text data (the same used for the Sentence-level Quality Estimation task) consist in English-German triplets (source, target and post-edit) belonging to the IT domain and already tokenized.
Training and development respectively contain 12,000 and 1,000 triplets, while the test set 2,000 instances. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Training, development and test data consist in German sentences belonging to the IT domain and already tokenized. These sentences are the references of the data released for the 2016 edition of the WMT APE shared task. Differently from the data previously released, these sentences are obtained by manually translating the source sentence without leveraging the raw mt outputs. Training and development respectively contain 12,000 and 1,000 segments, while the test set 2,000 items. All data is provided by the EU project QT21 (http://www.qt21.eu/).