This is a linguistically unannotated corpus of various historical texts written between 1543 and 1809.
The corpus consists of 3,428,618 words and is available for online browsing.
The corpus contains speech data of 2 Czech native speakers, male and female. The speech is very precisely articulated up to hyper-articulated, and the speech rate is low. The speech data with a highlighted articulation is suitable for teaching foreigners the Czech language, and it can also be used for people with hearing or speech impairment. The recorded sentences can be used either directly, e.g., as a part of educational material, or as source data for building complex educational systems incorporating speech synthesis technology. All recorded sentences were precisely orthographically annotated and phonetically segmented, i.e., split into phones, using modern neural network-based methods.
140 million words; Corpus of the Contemporary Lithuanian Language which comprises 160 million words is a collection of texts designed to represent current Lithuanian. The corpus is compiled from printed material during Lithuania's independence period (since 1990). The corpus is designed to represent as wide a range of contemporary written Lithuanian as possible. The largest part of the corpus is comprised of General Press (texts from regional and national newspapers), Popular Press, and Special Press (specialized newspapers and magazines). These texts have been intended for general readers, as well as specialists. The rest of the corpus consists of Fiction, Memoirs, other literature (scientific and popular), and various official texts. The larger part of the corpus is freely accessible for online search at http://donelaitis.vdu.lt.
The electronic version of the book “Corpus PAAU 1992: Descriptive Studies, Texts and Vocabulary” includes the texts that have been object of analysis in this project as well as the vocabulary lists that make up the Corpus 92.