Dancer Štěpánka Klimešová-Poláková dances in a Cupid costume. The artist on her wedding day on 7 November 1926 in front of the Church of St. Wenceslas in Prague-Smíchov.
Unedited film footage from a visit of a sixteen-member delegation from the Kingdom of Serbs, Croats and Slovenes to Czechoslovakia on the occasion of the first anniversary of the dissolution of the Austro-Hungarian Empire and the establishment of the Czechoslovak Republic. The delegation was led by Serbian General Stevan Hadžić. Footage from the railway station in Tábor. Welcome by Mayor Josef Šáda, his deputies, Sokol representatives and the district governor. Arrival of the train in Benešov. Welcome by Mayor František Novotný, his deputies and Sokol representatives. The train driving through the Královské Vinohrady (Royal Vineyards) railway station in Prague below Nuselské schody and the Vinohrady tunnel. Welcome at the Wilson Railway Station in Prague. General Hadžić departs in a car driving along Wilson Street towards Wenceslaus Square. His car is followed by the Kornilovs, legionnaires and Sokols on horseback. Welcoming crowds on Wenceslaus Square. Arrival at the first courtyard of Prague Castle. Footage from military manoeuvres between Milovice and Lipnice forests near Milovice that took place on 29 October 1918 under the command of General Bossi. The manoeuvres are attended by Colonel Kušakovic. The delegation in the courtyard of the Škoda factory in Pilsen. General Stevan Hadžić decorates a military battalion during the renaming ceremony of the 102nd Infantry Regiment to Czechoslovak Infantry Regiment No. 48 Yugoslavia in Benešov.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 46B from 1943 presents footage of the voluntary work and help with harvesting organised by the Board of Trustees for the Education of Youth as part of mandatory service. Older teenagers worked at railway stations, unloading potatoes.
SFST is a finite state transducer toolkit for the implementation of morphologies and other applications of finite state transducers. SFST comprises a compiler and several tools for transforming, printing and applying transducers.
STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech morphology and syntax in elementary schools. The sentences contain both the PDT annotations and the school sentence analyses. The school sentence analyses were created by transforming the PDT annotations using handcrafted rules. Altogether the STYX 1.0 corpus contains 11 655 sentences.
Originally, the STYX 1.0 corpus was an inseparable part of the Styx system (http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F)
STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech morphology and syntax in elementary schools. The sentences contain both the PDT annotations and the school sentence analyses. The school sentence analyses were created by transforming the PDT annotations using handcrafted rules. Altogether the STYX 1.0 corpus contains 11 655 sentences.
Originally, the STYX 1.0 corpus was an inseparable part of the Styx system (http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F)
This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al.
The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format.
The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation.
Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
SumeCzech-NER
SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset).
Format
The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are:
- dataset: train, dev, test, oodtest
- ne_abstract: list of named entity annotations of article's abstract
- ne_headline: list of named entity annotations of article's headline
- ne_text: list of name entity annotations of article's text
- url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER
Annotations
We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions.
Tokenization
We used the following Python code for tokenization:
from typing import List
from nltk.tokenize import word_tokenize
def tokenize(text: str) -> List[str]:
for mark in ('.', ',', '?', '!', '-', '–', '/'):
text = text.replace(mark, f' {mark} ')
tokens = word_tokenize(text)
return tokens