Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court).
280 decisions were annotated by one trained annotator and then manually adjudicated by one trained curator. 70 decisions were annotated by two trained annotators and then manually adjudicated by one trained curator. Adjudication was conducted destructively, therefore dataset contains only the correct annotations and does not contain all original annotations.
Corpus was developed as training and testing material for text segmentation tasks. Dataset contains decision segmented into Header, Procedural History, Submission/Rejoinder, Court Argumentation, Footer, Footnotes, and Dissenting Opinion. Segmentation allows to treat different parts of text differently even if it contains similar linguistic or other features.
We defined 58 dramatic situations and annotated them in 19 play scripts. Then we selected only 5 well-recognized dramatic situations and annotated further 33 play scripts. In this version of the data, we release only play scripts that can be freely distributed, which is 9 play scripts. One play is annotated independently by three annotators.
The segment from the 1942 Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) Issue No. 18 features the event Týden národního zdraví (A Week for National Health) organised by The Ministry of the Interior and The Health Institute of the Protectorate of Bohemia and Moravia from 3 to 10 May 1942. The official goal of the event was to advocate for the importance of healthcare. The report covers the establishment of anti-tuberculosis stations in a number of places around the Protectorate. Footage of the measuring of body height and weight of patients. A showcase of how an X-ray station in Moravská Ostrava operates. Footage of doctors working with X-ray machines. A close-up of an X-ray image of the lungs. The segment includes footage of mobile X-ray cars set up for the treatment of child patients. Footage from a solarium intended for irradiating children with sunlamps.
Antonín Martin Brousil, the vice-chancellor of Prague's Academy of Performing Arts, and Mexican actress Rosaura Revueltas at the 1954 Karlovy Vary International Film Festival in a fragmented segment from the weekly film newsreel.
Painter Antonín Pelc with his wife Jarmila Záhořová in the studio in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 43. The painter in his studio on the day of his 60th birthday in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1955, issue no. 4.
Artificially created treebank of elliptical constructions (gapping), in the annotation style of Universal Dependencies. Data taken from UD 2.1 release, and from large web corpora parsed by two parsers. Input data are filtered, sentences are identified where gapping could be applied, then those sentences are transformed, one or more words are omitted, resulting in a sentence with gapping. Details in Droganova et al.: Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions, LREC 2018, Miyazaki, Japan.
This dataset contains a number of user product reviews which are publicly available on the website of an established Czech online shop with electronic devices. Each review consists of negative and positive aspects of the product. This setting pushes the customer to rate important characteristics.
We have selected 2000 positive and negative segments from these reviews and manually tagged their targets. Additionally, we selected 200 of the longest reviews and annotated them in the same way. The targets were either aspects of the evaluated product or some general attributes (e.g. price, ease of use).
This record contains audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic. The recordings have been provided by the official websites of the Chamber of Deputies, and the set contains them in their original format with no further processing.
Recordings cover all available audio files from 2013-11-25 to 2023-07-26. Audio files are packed by year (2013-2023) and quarter (Q1-Q4) in tar archives audioPSP-YYYY-QN.tar.
Furthermore, there are two TSV files: audioPSP-meta.quarterArchive.tsv contains metadata about archives, and audioPSP-meta.audioFile.tsv contains metadata about individual audio files.
This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years 2011, 2013 and 2014.
For each sentence, at most 10000 paraphrases were included (randomly selected from the full set).
The goal of using this dataset is to improve automatic evaluation of machine translation outputs.
If you use this work, please cite the following paper:
Tamchyna Aleš, Barančíková Petra: Automatic and Manual Paraphrases for MT Evaluation. In proceedings of LREC, 2016.