The corpus contains video files of Czech Television News Broadcasts and JSON files with annotations of faces that appear in the broadcasts. The annotations are composed of frames in which a face is seen, name of the person whose face is seen, gender of the person (male/female), and the image region containing the face. The intended use of the corpus is to train models of faces for face detection, face identification, face verification, and face tracking. For convinience two different JSON files are provided. They contain the same data, but in different arrangements. One file has the identity of the person on the top, the other has the object ID on the top, where the object is a facetrack. A demo python skript is available for showing how to access the data.
The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam Zprávy (https://www.seznamzpravy.cz/). It was introduced in "MLASK: Multimodal Summarization of Video-based News Articles" (Krubiński & Pecina, EACL 2023). The articles' publication dates range from September 2016 to February 2022.
The intended use case of the dataset is to model the task of multimodal summarization with multimodal output: based on a pair of a textual article and a short video, a textual summary is generated, and a single frame from the video is chosen as a pictorial summary.
Each document consists of the following:
- a .mp4 video
- a single image (cover picture)
- the article's text
- the article's summary
- the article's title
- the article's publication date
All of the videos are re-sampled to 25 fps and resized to the same resolution of 1280x720p. The maximum length of the video is 5 minutes, and the shortest one is 7 seconds. The average video duration is 86 seconds.
The quantitative statistics of the lengths of titles, abstracts, and full texts (measured in the number of tokens) are below. Q1 and Q3 denote the first and third quartiles, respectively.
/ - / mean / Q1 / Median / Q3 /
/ Title / 11.16 ± 2.78 / 9 / 11 / 13 /
/ Abstract / 33.40 ± 13.86 / 22 / 32 / 43 /
/ Article / 276.96 ± 191.74 / 154 / 231 / 343 /
The proposed training/dev/test split follows the chronological ordering based on publication data. We use the articles published in the first half (Jan-Jun) of 2021 for validation (2,482 instances) and the ones published in the second half (Jul-Dec) of 2021 and the beginning (Jan-Feb) of 2022 for testing (2,652 instances). The remaining data is used for training (36,109 instances).
The textual data is shared as a single .tsv file. The visual data (video+image) is shared as a single archive for validation and test splits, and the one from the training split is partitioned based on the publication date.
This is an XML dataset of 17 lecture recordings randomly sampled from the lectures recorded at the Faculty of Informatics, Brno, Czechia during 2010–2016. We drew a stratified sample of up to 25 video frames from each recording. In each video frame, we annotated lit projection screens and their condition. For each lit projection screen, we annotated lecture materials shown in the screen. The dataset contains 699 projection screen annotations, and 925 lecture materials.