The dataset used for the Ptakopět experiment on outbound machine translation. It consists of screenshots of web forms with user queries entered. The queries are available also in a text form. The dataset comprises two language versions: English and Czech. Whereas the English version has been fully post-processed (screenshots cropped, queries within the screenshots highlighted, dataset split based on its quality etc.), the Czech version is raw as it was collected by the annotators.
Post-editing and MQM annotations produced by the QT21 project. As described in
@InProceedings{specia-etal_MTSummit:2017,
author = {Specia, Lucia and Kim Harris and Frédéric Blain and Aljoscha Burchardt and Viviven Macketanz and Inguna Skadiņa and Matteo Negri and and Marco Turchi},
title = {Translation Quality and Productivity: A Study on Rich Morphology Languages},
booktitle = {Proceedings of Machine Translation Summit XVI},
year = {2017},
pages = {55--71},
address = {Nagoya, Japan},
}
Wörterbuch für Redensarten, Redewendungen, idiomatische Ausdrücke, feste Wortverbindungen; die Suchergebnisse werden jeweils nach den vier Dimensionen Redensart – Erläuterung – Beispiele – Ergänzungen angezeigt
The C4 corpus is a joined effort of the project Digitales Wörterbuch der deutschen Sprache (DWDS), the Austrian Academy Corpus (AAC), the Korpus Südtirol and the Schweizer Textkorpus (CHTK). The Corpus is composed of corpora of all four partner institutions.
1) Finds repeated sequences of words in documents (repetitiveness checker) 2) Finds common sequences of words in several documents (version comparison) A sequence of words consists of minimally two words. There is no upper limit of the number of words in a sequence, but sequences do not transgress sentence delimiters. There are several weight functions to choose from, each defining "good" sequences in a different way, based on word frequency, sequence lenght and number of repetitions.
The data contains the morphemic dictionary scanned in the PDF format. It is divided into 3 parts:
introductions.pdf - pp. 11-102
main_dictionary.pdf - pp. 113-506
appendices.pdf - pp. 509-645
The file contains all Czech verbs included in the Retrograde Morphemic Dictionary of Czech Language (Slavíčková Eleonora, Academia 1975).
The data was obtained by scanning a portion of the dictionary that contains words ending in -ci and -ti. Among them, there were 18 non-verbs, which were removed. Using OCR, the data was converted into the plain text format and the result was checked by two independent readers. However, if a user encounters a forgotten error, please report.