The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization.
More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042
The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
FicTree is a dependency treebank of Czech fiction manually annotated in the format of the analytical layer of the Prague Dependency Trebank. The treebank consists of 12,760 sentences (166,432 tokens). The texts come from eight literary works published in the Czech Republic between 1991 and 2007. The syntactic annotation of the treebank was first performed by two distinct parsers (MSTParser and MaltParser) trained on the PDT training data, then manually corrected. Any differences between the two versions were resolved manually (by another annotator).
The corpus is provided in a vertical format, where sentence boundaries are marked with a blank line. Every word form is written on a separate line, followed by five tab-separated attributes: lemma, tag, ID (word index in the sentence), head and deprel (analytical function, afun in the PDT formalism). The texts are shuffled in random chunks of maximum 100 words (respecting sentence boundaries). Each chunk is provided as a separate file, with the suggested division into train, dev and test sets written as file prefix.
This reference corpus of written Slovenian is a precursor to the Gigafida corpora (see http://hdl.handle.net/11356/1320 for version 2.0).
It contains 600 million words and 738.5 million tokens. In terms of annotation, it is tagged for morphosyntactic descriptors (MSD tags) and lemmatised.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 16B from 1945 captures an event organised by the Board of Trustees for the Education of Youth aimed against the mite infestation of bees. The Veterinary Laboratory of the City of Prague, where beekeepers had sent thirty bees from each beehive, sought different ways to stop the mite epidemic. Trained female instructors from the Board of Trustees for the Education of Youth helped with the research.
Filip Hauptmann, physical education promoter and an official of the Czechoslovak Sokol society, on a bench in the garden of his house. Hauptmann portrayed with four unidentified people Ï one woman and three men. One of the men gives Hauptmann Jiří Ota Parma´s book Mořeplavci (Seafarers).
Segment from Czechoslovak Aktualita Sound Newsreel 1943, issue no. 24, depicts a memorial act in V Holešovičkách, the Prague street that was the scene of the assassination of Acting Reich Protector Heydrich. The event was held to commemorate the first anniversary of the assassination in June 1943. Footage from a ceremonial meeting held on 4 June 1943 in the Prague City Council´ reception hall in the Municipal Library building. Deputy Mayor of Prague, J. Pfitzner, presents historian Josef Kliment with the City of Prague Foundation´s Heydrich Memorial Award for spreading the ideas of the Reich. The event is attended by Prime Minister of the Protectorate Government J. Krejčí and Minister of Education and People´s Enlightenment E. Moravec. Illustrative footage from a workers´ holiday organized by the Heydrich Foundation for Workers´ Recuperation, where the Protector´s legacy is being commemorated. The speech is followed by holidaymakers performing the Nazi salute. A ceremony to honour Heydrich is held in the Spanish Hall of Prague Castle. Acting Reich Protector K. Daluege enters the hall with Heydrich´s widow Lina and her children. The event is attended by K. H. Frank, E. Hácha, J. Krejčí, E. Moravec and workers´ and peasants´ representatives. The German Philharmonic performs Ludwig van Beethoven´s symphonic work Coriolanus. The piece is followed by a speech by Daluege, which he symbolically concludes with "a greeting to my friend" Heydrich (authentic sound).