In NLP Centre, dividing text into sentences is currently done with
a tool which uses rule-based system. In order to make enough training
data for machine learning, annotators manually split the corpus of contemporary text
CBB.blog (1 million tokens) into sentences.
Each file contains one hundredth of the whole corpus and all data were
processed in parallel by two annotators.
The corpus was created from ten contemporary blogs:
hintzu.otaku.cz
modnipeklo.cz
bloc.cz
aleneprokopova.blogspot.com
blog.aktualne.cz
fuchsova.blog.onaidnes.cz
havlik.blog.idnes.cz
blog.aktualne.centrum.cz
klusak.blogspot.cz
myego.cz/welldone
AGREE is a dataset and task for evaluation of language models based on grammar agreement in Czech. The dataset consists of sentences with marked suffixes of past tense verbs. The task is to choose the right verb suffix which depends on gender, number and animacy of subject. It is challenging for language models because 1) Czech is morphologically rich, 2) it has relatively free word order, 3) high out-of-vocabulary (OOV) ratio, 4) predicate and subject can be far from each other, 5) subjects can be unexpressed and 6) various semantic rules may apply. The task provides a straightforward and easily reproducible way of evaluating language models on a morphologically rich language.
Sentence-parallel corpus made from English and Czech Wikipedias based on translated articles from English into Czech.
The work done is described in the paper: ŠTROMAJEROVÁ, Adéla, Vít BAISA a Marek BLAHUŠ. Between Comparable and Parallel: English-Czech Corpus from Wikipedia. In RASLAN 2016 Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016. s. 3-8, 6 s. ISBN 978-80-263-1095-2.
The SQAD database consists of 3301 records obtained from Czech Wikipedia articles. The record structure is following:
- the original sentence(s) from Wikipedia
- a question that is directly answered in the text
- the expected answer to the question as it appears in the original text
- the URL of the Wikipedia web page from which the original text was extracted
- name of the author of this SQAD record
Simple question answering database version 3 (SQAD v3) created from Czech Wikipedia. New version consits of 13477 records. Each record of SQAD consist of multiple files - question, answer extraction, answer selection, ulr, question metadata and in some cases answer context.