Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 9B from 1943 captures an ice sports course for mandatory youth service instructors, which was organised by the Board of Trustees for the Education of Youth as part of the Ice Sports Week event held at Štvanice Ice Arena in Prague from 1 to 6 February. Training in speed skating and ice hockey was led by hockey players Josef Maleček, Vladimír Zábrodský and Jiří Tožička. The event was attended by General Secretary of the Board František Teuner.
CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:
- OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents.
The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series.
The nature of the bitexts are paraphrasing of each other's meaning, rather than translations.
- TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015.
The size of the original corpora collected from OPUS and TED talks is as follows:
CS/VI EN/VI
Sentence 1337199/1337199 2035624/2035624
Word 9128897/12073975 16638364/17565580
Unique word 224416/68237 91905/78333
We improve the quality of the corpora in two steps: normalizing and filtering.
In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly.
In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs.
The size of cleaned corpora as published is as follows:
CS/VI EN/VI
Sentence 1091058/1091058 1113177/1091058
Word 6718184/7646701 8518711/8140876
Unique word 195446/59737 69513/58286
The corpora are used as training data in [2].
References:
[1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey.
[2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
CUBBITT En-Cs translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->cs: 27.6
cs->en: 34.4
(Evaluated using multeval: https://github.com/jhclark/multeval)
Web corpus of Czech, created in 2011. Contains newspapers+magazines, discussions, blogs. See http://www.lrec-conf.org/proceedings/lrec2012/summaries/120.html for details. and GA405/09/0278
This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague, as reported in the institute's system Biblio. For each publication, the authors are obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. No filtering was performed, except for removing entries missing the Czech or English abstract, and replacing newline and tabulator characters by spaces.
This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague. For each publication record, the authors are obliged to provide both the original abstract (in Czech or English), and its translation (English or Czech) in the internal Biblio system. The data was filtered for duplicates and missing entries, ensuring that every record is bilingual. Additionally, records of published papers which are indexed by SemanticScholar contain the respective link. The dataset was created from September 2022 image of the Biblio database and is stored in JSONL format, with each line corresponding to one record.
The database contains annotated reflective sentences, which fall into the categories of reflective writing according to Ullmann's (2019) model. The dataset is ready to replicate these categories' prediction using machine learning. Available from: https://anonymous.4open.science/repository/c856595c-dfc2-48d7-aa3d-0ccc2648c4dc/data
This is the Czech Court Decisions Corpus (CzCDC 1.0). This corpus contains whole texts of the decisions from three top-tier courts (Supreme, Supreme Administrative and Constitutional court) in Czech republic. Court decisions are published from 1st January 1993 to 30th September 2018.
The language of decisions is Czech. Content of decisions is unedited and obtained directly from the competent court.
Decisions are in .txt format in three folders divided by courts.
Corpus contains three .csv files containing the list of all decisions with four columns:
- name of the file: exact file name of a decision with extension .txt;
- decision identifier (docket number): official identification of the decision as issued by the court;
- date of decision: in ISO 8601 (YYYY-MM-DD);
- court abbreviation: SupCo for Supreme Court, SupAdmCo for Supreme Administrative Court, ConCo for Constitutional Court
Statistics:
- SupCo: 111 977 decisions, 23 699 639 lines, 224 061 129 words, 1 462 948 200 bits;
- SupAdmCo: 52 660 decisions, 18 069 993 lines, 137 839 985 words, 1 067 826 507 bits;
- ConCo: 73 086 decisions, 6 178 371 lines, 98 623 753 words, 664 657 755 bits
- all courts combined: 237 723 decisions, 47 948 003 lines, 460 524 867 words, 3 195 432 462 bits
We present the Czech Court Decisions Dataset (CCDD) -- a dataset of 300 manually annotated court decisions published by The Supreme Court of the Czech Republic and the Constitutional Court of the Czech Republic.
AGREE is a dataset and task for evaluation of language models based on grammar agreement in Czech. The dataset consists of sentences with marked suffixes of past tense verbs. The task is to choose the right verb suffix which depends on gender, number and animacy of subject. It is challenging for language models because 1) Czech is morphologically rich, 2) it has relatively free word order, 3) high out-of-vocabulary (OOV) ratio, 4) predicate and subject can be far from each other, 5) subjects can be unexpressed and 6) various semantic rules may apply. The task provides a straightforward and easily reproducible way of evaluating language models on a morphologically rich language.