Show simple item record

 
dc.contributor.author Krubiński, Mateusz
dc.contributor.author Pecina, Pavel
dc.date.accessioned 2023-11-02T15:01:55Z
dc.date.available 2023-11-02T15:01:55Z
dc.date.issued 2023
dc.identifier.uri http://hdl.handle.net/11234/1-5135
dc.description The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam Zprávy (https://www.seznamzpravy.cz/). It was introduced in "MLASK: Multimodal Summarization of Video-based News Articles" (Krubiński & Pecina, EACL 2023). The articles' publication dates range from September 2016 to February 2022. The intended use case of the dataset is to model the task of multimodal summarization with multimodal output: based on a pair of a textual article and a short video, a textual summary is generated, and a single frame from the video is chosen as a pictorial summary. Each document consists of the following: - a .mp4 video - a single image (cover picture) - the article's text - the article's summary - the article's title - the article's publication date All of the videos are re-sampled to 25 fps and resized to the same resolution of 1280x720p. The maximum length of the video is 5 minutes, and the shortest one is 7 seconds. The average video duration is 86 seconds. The quantitative statistics of the lengths of titles, abstracts, and full texts (measured in the number of tokens) are below. Q1 and Q3 denote the first and third quartiles, respectively. / - / mean / Q1 / Median / Q3 / / Title / 11.16 ± 2.78 / 9 / 11 / 13 / / Abstract / 33.40 ± 13.86 / 22 / 32 / 43 / / Article / 276.96 ± 191.74 / 154 / 231 / 343 / The proposed training/dev/test split follows the chronological ordering based on publication data. We use the articles published in the first half (Jan-Jun) of 2021 for validation (2,482 instances) and the ones published in the second half (Jul-Dec) of 2021 and the beginning (Jan-Feb) of 2022 for testing (2,652 instances). The remaining data is used for training (36,109 instances). The textual data is shared as a single .tsv file. The visual data (video+image) is shared as a single archive for validation and test splits, and the one from the training split is partitioned based on the publication date.
dc.language.iso ces
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.isreferencedby https://aclanthology.org/2023.findings-eacl.67.pdf
dc.rights Seznam Dataset Licence
dc.rights.uri https://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licence
dc.source.uri https://github.com/ufal/MLASK
dc.subject Multimodal Summarization
dc.subject Summarization
dc.subject Video
dc.subject Image
dc.title MLASK: Multimodal Summarization of Video-based News Articles
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType video
dc.rights.label RES
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Mateusz Krubiński krubinski@ufal.mff.cuni.cz ÚFAL MFF UK
sponsor Czech Science Foundation 19-26934X Neural Representations in Multi-modal and Multi-lingual Modelling nationalFunds
sponsor CELSA 19/018 CELL - ContExtual machine Learning of Language translations Other
files.size 788465693932
files.count 8


 Files in this item

This item is
Restricted Use
and licensed under:
Seznam Dataset Licence
Icon
Name
dev_MLASK_visual_01-2021_06-2021.tar.gz
Size
48.91 GB
Format
application/x-gzip
Description
dev_MLASK_visual_01-2021_06-2021
MD5
a149967314ed0943287e057fefcf8008
 Download file
Icon
Name
test_MLASK_visual_07-2021_02-2022.tar.gz
Size
53.22 GB
Format
application/x-gzip
Description
test_MLASK_visual_07-2021_02-2022
MD5
70ec8735da4e9e2562cfb32a24abab76
 Download file
Icon
Name
train_MLASK_visual_02-2019_09-2019.tar.gz
Size
126.79 GB
Format
application/x-gzip
Description
train_MLASK_visual_02-2019_09-2019
MD5
63b618313b64ff0326911761732eabc2
 Download file
Icon
Name
train_MLASK_visual_08-2018_01-2019.tar.gz
Size
108.28 GB
Format
application/x-gzip
Description
train_MLASK_visual_08-2018_01-2019
MD5
b1e00d27fc51f1fbbc01a5942a270ec4
 Download file
Icon
Name
train_MLASK_visual_09-2016_09-2017.tar.gz
Size
126.49 GB
Format
application/x-gzip
Description
train_MLASK_visual_09-2016_09-2017
MD5
c097fd27390b963aa2d65187daf446a5
 Download file
Icon
Name
train_MLASK_visual_10-2017_07-2018.tar.gz
Size
150.27 GB
Format
application/x-gzip
Description
train_MLASK_visual_10-2017_07-2018
MD5
f133dfb13d7dae8514b4505eb8db94e3
 Download file
Icon
Name
train_MLASK_visual_10-2019_12-2020.tar.gz
Size
120.33 GB
Format
application/x-gzip
Description
train_MLASK_visual_10-2019_12-2020
MD5
f60719543c1eb663c41cd63c36295c86
 Download file
Icon
Name
MLASK_text_09-2016_02-2022.tsv.gz
Size
37.31 MB
Format
application/x-gzip
Description
MLASK_text_09-2016_02-2022
MD5
be17314c4e62f9bae4f1c95bf8407cd2
 Download file

Show simple item record