Segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1943, issue no. 11B, reports on a workers´ holiday organized by the Reinhard Heydrich Foundation for Workers´ Recuperation at the Gymnasion Health Resort in Jarov u Dolních Břežan. Workers are having a warm-up exercise and practise shot putting. Everyone gets an apple as a snack. Minister of Agriculture and Forestry Adolf Hrubý comes for a visit.
Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test sets.
The English data includes manual annotations of English reference translations of Czech source texts. This texts were translated independently by two translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. Both the reference translations were annotated, which means 2000 annotated segments in total.
The Czech data includes manual annotations of Czech reference translations of English source texts. This texts were translated independently by three translators. After some necessary cleanings, 1000 segments were randomly selected for manual annotation. All three reference translations were annotated, which means 3000 annotated segments in total.
Faust is part of PDT-C 1.0 (http://hdl.handle.net/11234/1-3185).
This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308).
Each original (noisy) sentence was normalized (clean1 and clean2) and translated to English independently by two translators.
Dramaturgist Ferdinand Pujman at the funeral of writer Marie Pujmanová in Vyšehrad Cemetery in Prague in May 1958 in a fragmented segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1958, issue no. 22. Pujman with his son, translator Petr Pujman. Pujmanová celebrating her 60th birthday in a fragmented segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1953, issue no. 25. Poet Vítězslav Nezval is seen on her left behind the platform.
The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization.
More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042
The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
FicTree is a dependency treebank of Czech fiction manually annotated in the format of the analytical layer of the Prague Dependency Trebank. The treebank consists of 12,760 sentences (166,432 tokens). The texts come from eight literary works published in the Czech Republic between 1991 and 2007. The syntactic annotation of the treebank was first performed by two distinct parsers (MSTParser and MaltParser) trained on the PDT training data, then manually corrected. Any differences between the two versions were resolved manually (by another annotator).
The corpus is provided in a vertical format, where sentence boundaries are marked with a blank line. Every word form is written on a separate line, followed by five tab-separated attributes: lemma, tag, ID (word index in the sentence), head and deprel (analytical function, afun in the PDT formalism). The texts are shuffled in random chunks of maximum 100 words (respecting sentence boundaries). Each chunk is provided as a separate file, with the suggested division into train, dev and test sets written as file prefix.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 16B from 1945 captures an event organised by the Board of Trustees for the Education of Youth aimed against the mite infestation of bees. The Veterinary Laboratory of the City of Prague, where beekeepers had sent thirty bees from each beehive, sought different ways to stop the mite epidemic. Trained female instructors from the Board of Trustees for the Education of Youth helped with the research.
Segment from Czechoslovak Aktualita Sound Newsreel 1943, issue no. 24, depicts a memorial act in V Holešovičkách, the Prague street that was the scene of the assassination of Acting Reich Protector Heydrich. The event was held to commemorate the first anniversary of the assassination in June 1943. Footage from a ceremonial meeting held on 4 June 1943 in the Prague City Council´ reception hall in the Municipal Library building. Deputy Mayor of Prague, J. Pfitzner, presents historian Josef Kliment with the City of Prague Foundation´s Heydrich Memorial Award for spreading the ideas of the Reich. The event is attended by Prime Minister of the Protectorate Government J. Krejčí and Minister of Education and People´s Enlightenment E. Moravec. Illustrative footage from a workers´ holiday organized by the Heydrich Foundation for Workers´ Recuperation, where the Protector´s legacy is being commemorated. The speech is followed by holidaymakers performing the Nazi salute. A ceremony to honour Heydrich is held in the Spanish Hall of Prague Castle. Acting Reich Protector K. Daluege enters the hall with Heydrich´s widow Lina and her children. The event is attended by K. H. Frank, E. Hácha, J. Krejčí, E. Moravec and workers´ and peasants´ representatives. The German Philharmonic performs Ludwig van Beethoven´s symphonic work Coriolanus. The piece is followed by a speech by Daluege, which he symbolically concludes with "a greeting to my friend" Heydrich (authentic sound).
ForFun is a database of linguistic forms and their syntactic functions built with the use of the multi-layer annotated corpora of Czech, the Prague Dependency Treebanks. The purpose of the Prague Database of Forms and Functions (ForFun) is to help the linguists to study the form-function relation, which we assume to be one of the principal tasks of both theoretical linguistics and natural language processing.
A prototypical question to be asked is "What purposes does a preposition 'po' serve for" or "What are the linguistic means in the sentence that can express the meaning 'a destination of an action'?". There are almost 1500 distinct forms (besides the 'po' preposition) and 65 distinct functions (besides the 'destination').
The obituary of poet Fráňa Šrámek in a fragmented segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 30. Šrámek in archival footage.