Segment of the Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1939 No. 1 caputres the funeral of writer Karel Čapek at Vyšehrad Cemetery in Prague on 29 December 1938. The coffin with the deceased is carried out of the Church of St. Peter and St. Paul and across the cemetery to the grave. Theatre director Vojta Novák delivers a speech at the grave. The coffin is lowered into the grave. The mourners include Karel Čapek´s widow, actress and writer Olga Scheinpflugová, his brother-in law, journalist Karel Scheinpflug, writer Ferdinand Peroutka, Karel Čapek´s brother, painter and writer Josef Čapek, actor Hugo Haas, poet and theatre critic Hanuš Jelínek, poet Josef Hora, sociologist Miloslav Disman and others. The segment conludes with the Czech anthem.
Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech
The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf
Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech
The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf
This version fixes double annotation errors in train and dev M2 files, and also contains more metadata information.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) from late September 1938 captures the recording of a radio speech given by General Jan Syrový to accept his appointment to the office of Prime Minister on 22 September 1938, in which he responds to the national demonstration for the unity of Czechoslovakia held in front of the Parliament building in Prague. He urges the demonstrators, as well as all citizens, to remain calm and sensible and to return to work.
Fine-tuned Czech TinyLlama model (https://huggingface.co/BUT-FIT/CSTinyLlama-1.2B) and Czech GPT2 small model (https://huggingface.co/lchaloupsky/czech-gpt2-oscar) to generate lyrics of song sections based on the provided syllable counts, keywords and rhyme scheme. The TinyLlama-based model yields better results, however, the GPT2-based model can run locally.
Both models are discussed in a Bachelor Thesis: Generation of Czech Lyrics to Cover Songs.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 28 reports on the visit of Giuseppe Dalla Torre, the editor-in-chief of the Vatican City State´s daily newspaper of L´Osservatorio Romano, to Czechoslovakia.
Annotated list of dependency bigrams occurring in the PDT more than five times and having part-of-speech patterns that can possibly form a collocation. Each bigram is assigned to one of the six MWE categories by three annotators.
The dataset of handwritten Czech text lines, sourced from two chronicles (municipal chronicles 1931-1944, school chronicles 1913-1933).
The dataset comprises 25k lines machine-extracted from scanned pages, and provides manual annotation of text contents for a subset of size 2k.
HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style.
Update (November 1017): for a current collection of harmonized dependency treebanks, we recommend using the Universal Dependencies (UD). All of the corpora that are distributed in HamleDT in full are also part of the UD project; only some corpora from the Patch group (where HamleDT provides only the harmonizing scripts but not the full corpus data) are available in HamleDT but not in UD.
Actress Hana Vítová in an unidentified German film (sound). Vítová with actor Oldřich Nový in Valentin Dobrotivý (Valentin the Good, dir. Martin Frič, 1942). Vítová with her husband, critic Bedřich Rádl, in a segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1942, issue no. 49.