Corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 221 recordings made in 2002–2006 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 754, the metadata include sociolinguistic information about them.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC. and Výzkumný záměr MSM0021620823 – Český národní korpus a korpusy dalších jazyků
Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document. and MSM0021620823 – Český národní korpus a korpusy dalších jazyků