The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0). and Technology Agency of the Czech Republic, project No. TA01030476
The EBUContentGenre is a thesaurus containing the hierarchical description of various genres utilized in the TV broadcasting industry. This thesaurus is a part of a complex metadata specification called EBUCore intended for multifaceted description of audiovisual content. EBUCore (http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf) is a set of descriptive and technical metadata based on the Dublin Core and adapted to media. EBUCore is the flagship metadata specification of European Broadcasting Union, the largest professional association of broadcasters around the world. It is developed and maintained by EBU's Technical Department (http://tech.ebu.ch). The translated thesaurus can be used for effective cataloguing of (mostly TV) audiovisual content and consequent development of systems for automatic cataloguing (topic/genre detection). and Technology Agency of the Czech Republic, project No. TA01011264
The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization.
More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042
The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
This text corpus contains a carefully optimized set of sentences that could be used in the process of preparing a speech corpus for the development of personalized text-to-speech system. It was designed primarily for the voice conservation procedure that must be performed in a relatively short period before a person loses his/her own voice, typically because of the total laryngectomy.
Total laryngectomy is a radical treatment procedure which is often unavoidable to save life of patients who were diagnosed with severe laryngeal cancer. In spite of being very effective with respect to the primary treatment, it significantly handicaps the patients due to the permanent loss of their ability to use voice and produce speech. Luckily, the modern methods of computer text-to-speech (TTS) synthesis offer a possibility for "digital conservation" of patient's original voice for his/her future speech communication -- a procedure called voice banking or voice conservation. Moreover, the banking procedure can be undertaken by any person facing voice degradation or loss in farther future, or who is simply is willing to keep his/her voice-print.