The database actually contains two sets of recordings, both recorded in the moving or stationary vehicles (passenger cars or trucks). All data were recorded within the project “Intelligent Electronic Record of the Operation and Vehicle Performance” whose aim is to develop a voice-operated software for registering the vehicle operation data.
The first part (full_noises.zip) consists of relatively long recordings from the vehicle cabin, containing spontaneous speech from the vehicle crew. The recordings are accompanied with detailed transcripts in the Transcriber XML-based format (.trs). Due to the recording settings, the audio contains many different noises, only sparsely interspersed with speech. As such, the set is suitable for robust estimation of the voice activity detector parameters.
The second set (prompts.zip) consists of short prompts that were recorded in the controlled setting – the speakers either answered simple questions or they repeated commands and short phrases. The prompts were recorded by 26 different speakers. Each speaker recorded at least two sessions (with identical set of prompts) – first in stationary vehicle, with low level of noise (those recordings are marked by –A_ in the file name) and second while actually driving the car (marked by –B_ or, since several speakers recorded 3 sessions, by –C_). The recordings from this set are suitable mostly for training of the robust domain-specific speech recognizer and also ASR test purposes.
Dancer Štěpánka Klimešová-Poláková dances in a Cupid costume. The artist on her wedding day on 7 November 1926 in front of the Church of St. Wenceslas in Prague-Smíchov.
Unedited film footage from a visit of a sixteen-member delegation from the Kingdom of Serbs, Croats and Slovenes to Czechoslovakia on the occasion of the first anniversary of the dissolution of the Austro-Hungarian Empire and the establishment of the Czechoslovak Republic. The delegation was led by Serbian General Stevan Hadžić. Footage from the railway station in Tábor. Welcome by Mayor Josef Šáda, his deputies, Sokol representatives and the district governor. Arrival of the train in Benešov. Welcome by Mayor František Novotný, his deputies and Sokol representatives. The train driving through the Královské Vinohrady (Royal Vineyards) railway station in Prague below Nuselské schody and the Vinohrady tunnel. Welcome at the Wilson Railway Station in Prague. General Hadžić departs in a car driving along Wilson Street towards Wenceslaus Square. His car is followed by the Kornilovs, legionnaires and Sokols on horseback. Welcoming crowds on Wenceslaus Square. Arrival at the first courtyard of Prague Castle. Footage from military manoeuvres between Milovice and Lipnice forests near Milovice that took place on 29 October 1918 under the command of General Bossi. The manoeuvres are attended by Colonel Kušakovic. The delegation in the courtyard of the Škoda factory in Pilsen. General Stevan Hadžić decorates a military battalion during the renaming ceremony of the 102nd Infantry Regiment to Czechoslovak Infantry Regiment No. 48 Yugoslavia in Benešov.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 46B from 1943 presents footage of the voluntary work and help with harvesting organised by the Board of Trustees for the Education of Youth as part of mandatory service. Older teenagers worked at railway stations, unloading potatoes.
STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech morphology and syntax in elementary schools. The sentences contain both the PDT annotations and the school sentence analyses. The school sentence analyses were created by transforming the PDT annotations using handcrafted rules. Altogether the STYX 1.0 corpus contains 11 655 sentences.
Originally, the STYX 1.0 corpus was an inseparable part of the Styx system (http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F)
STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech morphology and syntax in elementary schools. The sentences contain both the PDT annotations and the school sentence analyses. The school sentence analyses were created by transforming the PDT annotations using handcrafted rules. Altogether the STYX 1.0 corpus contains 11 655 sentences.
Originally, the STYX 1.0 corpus was an inseparable part of the Styx system (http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F)
This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al.
The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format.
The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation.
Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.