The corpus contains speech data of 2 Czech native speakers, male and female. The speech is very precisely articulated up to hyper-articulated, and the speech rate is low. The speech data with a highlighted articulation is suitable for teaching foreigners the Czech language, and it can also be used for people with hearing or speech impairment. The recorded sentences can be used either directly, e.g., as a part of educational material, or as source data for building complex educational systems incorporating speech synthesis technology. All recorded sentences were precisely orthographically annotated and phonetically segmented, i.e., split into phones, using modern neural network-based methods.
CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:
- OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents.
The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series.
The nature of the bitexts are paraphrasing of each other's meaning, rather than translations.
- TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015.
The size of the original corpora collected from OPUS and TED talks is as follows:
CS/VI EN/VI
Sentence 1337199/1337199 2035624/2035624
Word 9128897/12073975 16638364/17565580
Unique word 224416/68237 91905/78333
We improve the quality of the corpora in two steps: normalizing and filtering.
In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly.
In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs.
The size of cleaned corpora as published is as follows:
CS/VI EN/VI
Sentence 1091058/1091058 1113177/1091058
Word 6718184/7646701 8518711/8140876
Unique word 195446/59737 69513/58286
The corpora are used as training data in [2].
References:
[1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey.
[2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
CUBBITT En-Cs translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->cs: 27.6
cs->en: 34.4
(Evaluated using multeval: https://github.com/jhclark/multeval)
CUBBITT En-Fr translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->fr: 38.2
fr->en: 36.7
(Evaluated using multeval: https://github.com/jhclark/multeval)
CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2020 (BLEU):
en->pl: 12.3
pl->en: 20.0
(Evaluated using multeval: https://github.com/jhclark/multeval)
We present the Czech Court Decisions Dataset (CCDD) -- a dataset of 300 manually annotated court decisions published by The Supreme Court of the Czech Republic and the Constitutional Court of the Czech Republic.
Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK.
Contracts are obtained from the Hlídač Státu web portal. Labels in the development and training set are automatically classified on the basis of the keyword method according to the thesis Automatická klasifikace smluv pro portál HlidacSmluv.cz, J. Maroušek (2020), MFF UK. For this reason, the goal in the classification is not to achieve 100% on the development set, as the classification contains a certain amount of noise. The test set is manually annotated. The dataset contains a total of 97493 contracts.
This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving three NLP tasks: machine translation, image captioning, and sentiment analysis.
The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks.
The models are described in the accompanying paper.
The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd
There are several separate ZIP archives here, each containing one model solving one of the tasks for one language.
To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey
To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory.
Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization).
The 'experiment.ini' file, which was used to train the model, is also included.
Then there are files containing the model itself, files containing the input and output vocabularies, etc.
For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/
For the machine translation, you do not need to tokenize the data, as this is done by the model.
For image captioning, you need to:
- download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz
- clone the git repository with TensorFlow models: https://github.com/tensorflow/models
- preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script
Feel free to contact the authors of this submission in case you run into problems!
This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving four NLP tasks: machine translation, image captioning, sentiment analysis, and summarization.
The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks.
The models are described in the accompanying paper.
The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd
In addition to the models presented in the referenced paper (developed and published in 2018), we include models for automatic news summarization for Czech and English developed in 2019. The Czech models were trained using the SumeCzech dataset (https://www.aclweb.org/anthology/L18-1551.pdf), the English models were trained using the CNN-Daily Mail corpus (https://arxiv.org/pdf/1704.04368.pdf) using the standard recurrent sequence-to-sequence architecture.
There are several separate ZIP archives here, each containing one model solving one of the tasks for one language.
To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey
To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory.
Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization).
The 'experiment.ini' file, which was used to train the model, is also included.
Then there are files containing the model itself, files containing the input and output vocabularies, etc.
For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/
For the machine translation, you do not need to tokenize the data, as this is done by the model.
For image captioning, you need to:
- download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz
- clone the git repository with TensorFlow models: https://github.com/tensorflow/models
- preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script
The summarization models require input that is tokenized with Moses Tokenizer (https://github.com/alvations/sacremoses) and lower-cased.
Feel free to contact the authors of this submission in case you run into problems!
The Czech Legal Text Treebank (CLTT) is a collection of 1133 manually annotated dependency trees. CLTT consists of two legal documents: The Accounting Act (563/1991 Coll., as amended) and Decree on Double-entry Accounting for undertakers (500/2002 Coll., as amended).