The database consists of three sets: - Many Talker Set: 30 males, 30 females; each to read 50 numbers, 1-2 connected passages, 1 block of "filler" sentences, and 1 block of syllables. - Few Talker Set: 4 males, 4 females; each to read 50 numbers, 10 connected passages, 1 block of "filler" sentences, and 2-3 blocks of syllables. - Very Few Talker Set: 1 male, 1 female; each to read 2 blocks of 50 numbers, 40 connected passages, 4 blocks of "filler" sentences, and 9 blocks of syllables. Total amount ca 12 hours of speech.
One million words of written and spoken English from Great Britain. Transcriptions aligned with digitised speech recordings. POS-tagged and parsed. Part of the International Corpus of English project. Custom-made search software: ICE-CUP
1 million words spoken and written English from UK. POS-tagged and parsed. Digitised speech recordings aligned w text. Part of the International Corpus of English (ICE).
Parallel corpus, 3,297,283 words.
The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts. At the same time it aimed at acquiring experience that could be used in creating a larger parallel corpus of English and Czech in the future.
Although the main part of work has been completed -- and the aims of the KACENKA grant met -- we keep improving and enlarging KACENKA gradually. Currently, it has the size of 3,297,283 words (out of which, 1,689,513 have been acquired by means of scanning).
Most of the English texts for KACENKA have been retrieved from the Internet resources. The rest -- and nearly all the Czech texts -- had to be scanned with the use of an OCR programme.
KACENKA is stored on a single CD-ROM; its use is limited by copyright restrictions.
Latvian fairytales and legends collected by Latvian folklorist Pēteris Šmits, published 1927-1938 (15 volumes). It is the largest published collection of Latvian folktales and legends.
A corpus of approximately 260,000 words of modern British narrative texts representing three text types (fiction, newpapers, biography) with detailed annotation for all forms of speech, thought and writing presentation which occur in the corpus. Available via OTA.