This package provides an evaluation framework, training and test data for semi-automatic recognition of sections of historical diplomatic manuscripts. The data collection consists of 57 Latin charters issued by the Royal Chancellery of 7 different types. Documents were created in the era of John the Blind, King of Bohemia (1310–1346) and Count of Luxembourg. Manuscripts were digitized, transcribed, and typical sections of medieval charters ('corroboratio', 'datatio', 'dispositio', 'inscriptio', 'intitulatio', 'narratio', and 'publicatio') were manually tagged. Manuscripts also contain additional metadata, such as manually marked named entities and short Czech abstracts.
Recognition models are first trained using manually marked sections in training documents and the trained model can then be used for recognition of the sections in the test data. The parsing script supports methods based on Cosine Distance, TF-IDF weighting and adapted Viterbi algorithm.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 17A, B from 1944 was shot during a meeting of the leaders of the Board of Trustees for the Education of Youth with State Minister Karel Hermann Frank, which was held in the Great Hall of Czernin Palace on 17 April and attended by Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec and General Secretary of the Board František Teuner. State Minister Frank addressed the participants and presented František Teuner with a sword of honour. The official event concluded with the participant paying homage to Adolf Hitler. The leaders of the Board of Trustees marched through the streets of Prague.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 52B from 1943 was shot during a meeting of the Board of Trustees for the Education of Youth, which was held in Lucerna Palace in Prague on 15 December 1943. The event was organised as part of the struggle against bolshevism. The event was attended by Prime Minister of the Protectorate Government Jaroslav Krejčí, Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec, Minister of Finance Josef Kalfus and other guests, who watched the proceedings from the boxes overlooking the packed Great Hall. Journalist Karel Werner and General Secretary of the Board František Teuner addressed speeches to the audience. The ceremony was concluded with an oath "to the Führer and to the Fatherland".
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 51A, B from 1944 was shot during an anti-bolshevist gathering of the Board of Trustees for the Education of Youth held in the Great Hall of Lucerna Palace on 7 December. The participants included Minister of Education and People´s Enlightenment and Chairman of the Board Emanuel Moravec. Speeches were given by General Secretary of the Board František Teuner and Eriks Rullis, the leader of Latvian Youth. The gathering was concluded with an oath "to the Führer and to the Fatherland".
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 21A, B from 1944 captures a solemn ceremony organised by the Board of Trustees for the Education of Youth to mark the 60th anniversary of Bedřich Smetana´s death, which took place at the composer´s grave on 11 May. The fanfare from the opera Dalibor was followed by a speech by General Secretary of the Board František Teuner, who also laid a wreath on Smetana´s grave. The choir of Prague teachers under the baton of Metod Doležal sang the chorus called "Dowry". The ceremony was concluded with the Nazi salute.
Segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1942, issue no. 12, captures the Memorial Day of Heroes events held as part of the celebration of the 3rd anniversary of the Protectorate of Bohemia and Moravia at the German Opera in Prague on 15 March 1942. The gathering was to commemorate the alleged affiliation of the Czech Crown lands to the German Empire. In his speech, Acting Reich Protector Reinhard Heydrich highlights the political significance of 15 March (silent). The event is attended by Reich Secretary Karl Hermann Frank, Reich Commissioner for the Sudetenland Konrad Henlein, Reich Gauleiter Hugo Jury, and the chief of the Wehrmacht troops in Prague, General Rudolf Toussaint.
Footage of actress Míla Spazierová-Hezká at the Secondary School of Decorative Arts shown with her own portrait in a segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1942, issue no. 28.
Music historian Mirko Očadlík on Bohumil Veselý's balcony. Očadlík in footage shot inside the Smetana Museum in Prague in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1952, issue no. 17.
The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam Zprávy (https://www.seznamzpravy.cz/). It was introduced in "MLASK: Multimodal Summarization of Video-based News Articles" (Krubiński & Pecina, EACL 2023). The articles' publication dates range from September 2016 to February 2022.
The intended use case of the dataset is to model the task of multimodal summarization with multimodal output: based on a pair of a textual article and a short video, a textual summary is generated, and a single frame from the video is chosen as a pictorial summary.
Each document consists of the following:
- a .mp4 video
- a single image (cover picture)
- the article's text
- the article's summary
- the article's title
- the article's publication date
All of the videos are re-sampled to 25 fps and resized to the same resolution of 1280x720p. The maximum length of the video is 5 minutes, and the shortest one is 7 seconds. The average video duration is 86 seconds.
The quantitative statistics of the lengths of titles, abstracts, and full texts (measured in the number of tokens) are below. Q1 and Q3 denote the first and third quartiles, respectively.
/ - / mean / Q1 / Median / Q3 /
/ Title / 11.16 ± 2.78 / 9 / 11 / 13 /
/ Abstract / 33.40 ± 13.86 / 22 / 32 / 43 /
/ Article / 276.96 ± 191.74 / 154 / 231 / 343 /
The proposed training/dev/test split follows the chronological ordering based on publication data. We use the articles published in the first half (Jan-Jun) of 2021 for validation (2,482 instances) and the ones published in the second half (Jul-Dec) of 2021 and the beginning (Jan-Feb) of 2022 for testing (2,652 instances). The remaining data is used for training (36,109 instances).
The textual data is shared as a single .tsv file. The visual data (video+image) is shared as a single archive for validation and test splits, and the one from the training split is partitioned based on the publication date.