- Title:
- SumeCzech
- Creator:
- Straka, Milan, Mediankin, Nikita, Kocmi, Tom, Žabokrtský, Zdeněk, Hudeček, Vojtěch, and Hajič, Jan
- Contributor:
- Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2015071@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@, Ministerstvo školství, mládeže a tělovýchovy České republiky@@CZ.02.1.01/0.0/0.0/16_013/0001781@@LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity@@nationalFunds@@, Charles University@@8502/2016@@GAUK 8502/2016@@Other@@, Charles University@@1114217/2017@@GAUK 1114217/2017@@Other@@, and Univerzita Karlova (mimo GAUK)@@SVV 260 453@@Specifický vysokoškolský výzkum@@nationalFunds@@
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Identifier:
- Subject:
- summarization, SumeCzech, and Rouge
- Type:
- text and corpus
- Description:
- This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al. The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format. The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation. Note: is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive was renamed to and is kept for reference.
- Language:
- Czech
- Rights:
- Mozilla Public License 2.0
PUB - Relation:
- Harvested from:
- LINDAT/CLARIAH-CZ repository
- Metadata only:
- false
- Date:
- 2018-02-13
The item or associated files might be "in copyright"; review the provided rights metadata:
- Mozilla Public License 2.0