This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate).
The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied.
The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
Human post-edited test sentences for the WMT 2017 Automatic post-editing task. This consists in 2,000 English sentences belonging to the IT domain and already tokenized. Source and target segments can be downloaded from: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2132. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Human post-edited test sentences for the WMT 2017 Automatic post-editing task. This consists in 2,000 German sentences belonging to the IT domain and already tokenized. Source and target segments can be downloaded from: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2133. All data is provided by the EU project QT21 (http://www.qt21.eu/).
This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years 2011, 2013 and 2014.
For each sentence, at most 10000 paraphrases were included (randomly selected from the full set).
The goal of using this dataset is to improve automatic evaluation of machine translation outputs.
If you use this work, please cite the following paper:
Tamchyna Aleš, Barančíková Petra: Automatic and Manual Paraphrases for MT Evaluation. In proceedings of LREC, 2016.
CUBBITT En-Cs translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->cs: 27.6
cs->en: 34.4
(Evaluated using multeval: https://github.com/jhclark/multeval)
CUBBITT En-Fr translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->fr: 38.2
fr->en: 36.7
(Evaluated using multeval: https://github.com/jhclark/multeval)
CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2020 (BLEU):
en->pl: 12.3
pl->en: 20.0
(Evaluated using multeval: https://github.com/jhclark/multeval)
This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving three NLP tasks: machine translation, image captioning, and sentiment analysis.
The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks.
The models are described in the accompanying paper.
The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd
There are several separate ZIP archives here, each containing one model solving one of the tasks for one language.
To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey
To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory.
Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization).
The 'experiment.ini' file, which was used to train the model, is also included.
Then there are files containing the model itself, files containing the input and output vocabularies, etc.
For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/
For the machine translation, you do not need to tokenize the data, as this is done by the model.
For image captioning, you need to:
- download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz
- clone the git repository with TensorFlow models: https://github.com/tensorflow/models
- preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script
Feel free to contact the authors of this submission in case you run into problems!
This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving four NLP tasks: machine translation, image captioning, sentiment analysis, and summarization.
The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks.
The models are described in the accompanying paper.
The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd
In addition to the models presented in the referenced paper (developed and published in 2018), we include models for automatic news summarization for Czech and English developed in 2019. The Czech models were trained using the SumeCzech dataset (https://www.aclweb.org/anthology/L18-1551.pdf), the English models were trained using the CNN-Daily Mail corpus (https://arxiv.org/pdf/1704.04368.pdf) using the standard recurrent sequence-to-sequence architecture.
There are several separate ZIP archives here, each containing one model solving one of the tasks for one language.
To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey
To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory.
Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization).
The 'experiment.ini' file, which was used to train the model, is also included.
Then there are files containing the model itself, files containing the input and output vocabularies, etc.
For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/
For the machine translation, you do not need to tokenize the data, as this is done by the model.
For image captioning, you need to:
- download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz
- clone the git repository with TensorFlow models: https://github.com/tensorflow/models
- preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script
The summarization models require input that is tokenized with Moses Tokenizer (https://github.com/alvations/sacremoses) and lower-cased.
Feel free to contact the authors of this submission in case you run into problems!