This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate).
The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied.
The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for evaluation of automatic speech recognition (ASR) systems, especially in conditions where the prior availability of in-domain vocabulary and named entities is benefitable.
The corpus consists of 39 presentations in English, each up to 90 seconds long, and slides and web-pages in Czech, Slovak, English, German, Romanian, Italian or Spanish.
The speakers are high school students from European countries with English as their second language.
We benchmark three baseline ASR systems on the corpus and show their imperfection.
Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono. and Technology Agency of the Czech Republic, project No. TA01030476.
Balaxan is the first speech corpus of Kurmanji Kurdish with 58 utterances by speakers of Kurmanji. utterances are divided into 4 categories based on their sentence structures: Declarative, Imperative, Interrogative, and Exclamatory. The corpus has subtitles both in Kurmanji (Latin alphabet) and English.
The corpus contains speech data of 2 Czech native speakers, male and female. The speech is very precisely articulated up to hyper-articulated, and the speech rate is low. The speech data with a highlighted articulation is suitable for teaching foreigners the Czech language, and it can also be used for people with hearing or speech impairment. The recorded sentences can be used either directly, e.g., as a part of educational material, or as source data for building complex educational systems incorporating speech synthesis technology. All recorded sentences were precisely orthographically annotated and phonetically segmented, i.e., split into phones, using modern neural network-based methods.
The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four types of automatic transcripts, manual annotations of selected topics and interviews' metadata. The archive totally contains 353 recordings and 592 hours of interviews.
The corpus consists of recordings from the Chamber of Deputies of the Parliament of the Czech Republic. It currently consists of 88 hours of speech data, which corresponds roughly to 0.5 million tokens. The annotation process is semi-automatic, as we are able to perform the speech recognition on the data with high accuracy (over 90%) and consequently align the resulting automatic transcripts with the speech. The annotator’s task is then to check the transcripts, correct errors, add proper punctuation and label speech sections with information about the speaker. The resulting corpus is therefore suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 18 sound files (WAV PCM, 16-bit, 44.1 kHz, mono) and corresponding transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
The date of airing of a particular recording is encoded in the filename in the form SOUND_YYMMDD_*. Note that the recordings are usually aired in the early morning on the day following the actual Parliament session. If the recording is too long to fit in the broadcasting scheme, it is divided into several parts and aired on the consecutive days.
The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors. and European Commission Sixth Framework Programme
Information Society Technologies Integrated Project IST-34434
The corpus contains Czech speech of laryngectomy patients recorded before a surgery causing their voice to be lost in order to preserve the voice which can be later used for personalized text-to-speech system. Individual utterances were selected from the language by a special algorithm to cover as much phonetic and prosodic features as possible.
The corpus contains recordings of male speaker, native in Czech, talking in English. The sentences that were read by the speaker originate in the domain of air traffic control (ATC), specifically the messages used by plane pilots during routine flight. The text in the corpus originates from the transcripts of the real recordings, part of which has been released in LINDAT/CLARIN (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0), and individual phrases were selected by special algorithm described in Jůzová, M. and Tihelka, D.: Minimum Text Corpus Selection for Limited Domain Speech Synthesis (DOI 10.1007/978-3-319-10816-2_48). The corpus was used to create a limited domain speech synthesis system capable of simulating a pilot communication with an ATC officer.