Segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1942, issue no. 17, captures the presentation of a gift Ï Ambulance Train no. 751 Ï from the Protectorate of Bohemia and Moravia to Adolf Hitler and the German army. The train handover took place at Prague Main Railway Station on 20 April 1942, the birthday of Adolf Hitler. Cars arrive in front of Prague Main Railway Station. Acting Reich Protector Reinhard Heydrich enters the train station. State President Emil Hácha gives a speech in the festively decorated railway hall. In response, Heydrich shakes his hand. The event is witnessed by a delegation of railway workers. The train crew lines up on the station platform. Heydrich enters the train with his entourage and inspects the sleeping cars, the operating carriage, the kitchen, and the sick bay. The inspection of the ambulance train is attended by Protectorate Prime Minister Jaroslav Krejčí and Minister of Education and People´s Enlightenment Emanuel Moravec. According to the voiceover, the train was made in a railway workshop in Prague-Bubny in record time. It consisted of 28 carriages and 20 hospital carriages, was 410 metres long, weighed 545 tons and had capacity for 280 wounded.
This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) 1942, issue no. 27A, captures the Pledge of Czech Theatre Professionals´ Allegiance to the Reich, a manifestation held at the National Theatre in Prague on 25 June 1942, which was to unequivocally condemn the assassination of Acting Reich Protector Reinhard Heydrich. Speeches are delivered by actor Rudolf Deyl Jr. and Minister of Education and People´s Enlightenment Emanuel Moravec (silent). Actress Růžena Nasková and actors Karel Höger, Ferenc Futurista, and Stanislav Neumann are seen among the participants. The segment concludes with everyone performing the Nazi salute.
This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate).
The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied.
The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
Segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1941, issue no. 40, captures events linked to the accession of SS-Obergruppenführer Reinhard Heydrich to the office of Deputy Reich Protector of the Protectorate of Bohemia and Moravia on 27 September 1941. Heydrich attends an SS military parade on Hradčanské Square in Prague. Military dignitaries and state officials welcome him in the first quadrangle of Prague Castle. The Nazi flag flies over Prague Castle. Reich Commissioner for the Sudetenland Konrad Henlein and Reich Secretary Karl Hermann Frank are present at the occasion. State President Emil Hácha receives Reinhard Heydrich at Prague Castle.
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved. and This project has been sponsored by the grants GAČR P406/11/1499 and EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic)
Lexical network AdjDeriNet consists of pairs of base adjectives and their derivatives. It contains nearly 18 thousand base adjectives that are base words for more than 26 thousand lexemes of several parts of speech.
Painter Adolf Hoffmeister on Bohumil Veselý's balcony. Hoffmeister in a fragmented segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 26.
Corpus AKCES 2 consists of trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and MŠMT (MSM0021620825), UK (PRVOUK P 10)
Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants. and UK, PRVOUK P10
Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora) and ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text.
Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text.
In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes.
AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format.
Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences.
If you use this dataset, please use following citation:
@article{naplava2019wnut,
title={Grammatical Error Correction in Low-Resource Scenarios},
author={N{\'a}plava, Jakub and Straka, Milan},
journal={arXiv preprint arXiv:1910.00353},
year={2019}
}
Ornithologist Alfréd Hořice with his collection of stuffed birds in a fragmented segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1945, issue no. 19.
Cold water swimmer Alfréd Nikodém as the oldest participant in a swimming race in the Vltava River in a segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1942, issue no. 36. Nikodém with a bouquet of flowers by Svatopluk Čech Bridge.
Conductor Alois Klíma conducts the Prague Radio Symphony Orchestra at the Prague Spring International Music Festival in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 24. The orchestra performs Bedřich Smetana´s symphony Tábor.
The segment of Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel), 1938, issue no. 23 captures Peter Hletko´s speech on the importance of the Pittsburgh Agreement and continues with a report on the meeting between the five-member delegation of the American Slovak League, led by Peter Hletko, and President Edvard Beneš, which was held in Prague on 30 May 1938.
The PARSEME shared task aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider 'to suicide' in French). VMWEs were annotated according to the universal guidelines in 18 languages. The corpora are provided in the parsemetsv format, inspired by the CONLL-U format.
For most languages, paired files in the CONLL-U format - not necessarily using UD tagsets - containing parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
This item contains training and test data, tools and the universal guidelines file.
Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court).
Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore dataset contains all original annotations.
Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court).
Every decision is annotated by two trained annotators and then manually adjudicated by one trained curator to solve possible disagreements between annotators. Adjudication was conducted non-destructively, therefore corpus (raw) contains all original annotations.
Corpus was developed as training and testing material for reference recognition tasks. Dataset contains references to other court decisions and literature. All references consist of basic units (identifier of court decision, identification of court issuing referred decision, author of book or article, title of book or article, point of interest in referred document etc.), values (polarity, depth of discussion etc.).
Annotated corpus of 350 decision of Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court).
280 decisions were annotated by one trained annotator and then manually adjudicated by one trained curator. 70 decisions were annotated by two trained annotators and then manually adjudicated by one trained curator. Adjudication was conducted destructively, therefore dataset contains only the correct annotations and does not contain all original annotations.
Corpus was developed as training and testing material for text segmentation tasks. Dataset contains decision segmented into Header, Procedural History, Submission/Rejoinder, Court Argumentation, Footer, Footnotes, and Dissenting Opinion. Segmentation allows to treat different parts of text differently even if it contains similar linguistic or other features.
We defined 58 dramatic situations and annotated them in 19 play scripts. Then we selected only 5 well-recognized dramatic situations and annotated further 33 play scripts. In this version of the data, we release only play scripts that can be freely distributed, which is 9 play scripts. One play is annotated independently by three annotators.
Antonín Martin Brousil, the vice-chancellor of Prague's Academy of Performing Arts, and Mexican actress Rosaura Revueltas at the 1954 Karlovy Vary International Film Festival in a fragmented segment from the weekly film newsreel.
Painter Antonín Pelc with his wife Jarmila Záhořová in the studio in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 43. The painter in his studio on the day of his 60th birthday in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1955, issue no. 4.
Artificially created treebank of elliptical constructions (gapping), in the annotation style of Universal Dependencies. Data taken from UD 2.1 release, and from large web corpora parsed by two parsers. Input data are filtered, sentences are identified where gapping could be applied, then those sentences are transformed, one or more words are omitted, resulting in a sentence with gapping. Details in Droganova et al.: Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions, LREC 2018, Miyazaki, Japan.
This dataset contains a number of user product reviews which are publicly available on the website of an established Czech online shop with electronic devices. Each review consists of negative and positive aspects of the product. This setting pushes the customer to rate important characteristics.
We have selected 2000 positive and negative segments from these reviews and manually tagged their targets. Additionally, we selected 200 of the longest reviews and annotated them in the same way. The targets were either aspects of the evaluated product or some general attributes (e.g. price, ease of use).
This record contains audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic. The recordings have been provided by the official websites of the Chamber of Deputies, and the set contains them in their original format with no further processing.
Recordings cover all available audio files from 2013-11-25 to 2023-07-26. Audio files are packed by year (2013-2023) and quarter (Q1-Q4) in tar archives audioPSP-YYYY-QN.tar.
Furthermore, there are two TSV files: audioPSP-meta.quarterArchive.tsv contains metadata about archives, and audioPSP-meta.audioFile.tsv contains metadata about individual audio files.
This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years 2011, 2013 and 2014.
For each sentence, at most 10000 paraphrases were included (randomly selected from the full set).
The goal of using this dataset is to improve automatic evaluation of machine translation outputs.
If you use this work, please cite the following paper:
Tamchyna Aleš, Barančíková Petra: Automatic and Manual Paraphrases for MT Evaluation. In proceedings of LREC, 2016.
Automatically generated spelling correction corpus for Czech (Czesl-SEC-AG) is a corpus containg text with automatically generated spelling errors. To create spelling errors, a character error model containing probabilities of character substitution, insertion, deletion and probabilities of swaping two adjacent characters is used. Besides these probabilities, also the probabilities of changing character casing are considered. The original clean text on which the spelling errors were generated is PDT3.0 (http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3). The original train/dev/test sentence split of PDT3.0 corpus is preserved in this dataset.
Besides the data with artificial spelling errors, we also publish texts from which the character error model was created. These are the original manual transcript of an audiobook Švejk and its corrected version performed by authors of Korektor (http://ufal.mff.cuni.cz/korektor). These data are similarly to CzeSL Grammatical Error Correction Dataset (CzeSL-GEC: http://hdl.handle.net/11234/1-2143) processed into four sets based on error difficulty present.
Sculptor Bohumil Kafka works on a statue of Josef Mánes in a fragmented segment from the Ufa žurnál (Ufa Journal) 1939, issue no. 200. The unveiling of the monument by the Rudolfinum, including a speech by Professor Vratislav Nechleba in a fragmented segment from Československé filmové noviny (Czechoslovak Film News) 1951, issue no. 52. Kafka at Prague Zoo working on a study of a lion for the Milan Rastislav Štefánik's monument in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1937, issue no. 5. Kafka with politician Milan Hodža in the artist´s studio in Prague-Dejvice.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Relationship extraction models for the Czech language. Models are trained on CERED (dataset created by distant supervision on Czech Wikipedia and Wikidata) and recognize a subset of Wikidata relations (listed in CEREDx.LABELS).
We supply a demo.py that performs inference on user-defined input and requirements.txt file for pip. Adapt the demo code to use the model.
Both the dataset and the models are presented in Relationship Extraction thesis.