The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0). and Technology Agency of the Czech Republic, project No. TA01030476
The corpus consists of transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. It contains 35 hours of speech and corresponding word-by-word transcriptions, including the transcription of some non-speech events. Speakers’ names are also assigned to corresponding segments. The resulting corpus is suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 16 sound files (WAV PCM, 16-bit, 48 kHz, mono) and transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)