1 - 3 of 3
Number of results to display per page
Search Results
2. Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)
- Creator:
- Hajič, Jan, Bejček, Eduard, Bémová, Alevtina, Buráňová, Eva, Fučíková, Eva, Hajičová, Eva, Havelka, Jiří, Hlaváčová, Jaroslava, Homola, Petr, Ircing, Pavel, Kárník, Jiří, Kettnerová, Václava, Klyueva, Natalia, Kolářová, Veronika, Kučová, Lucie, Lopatková, Markéta, Mareček, David, Mikulová, Marie, Mírovský, Jiří, Nedoluzhko, Anna, Novák, Michal, Pajas, Petr, Panevová, Jarmila, Peterek, Nino, Poláková, Lucie, Popel, Martin, Popelka, Jan, Romportl, Jan, Rysová, Magdaléna, Semecký, Jiří, Sgall, Petr, Spoustová, Johanka, Straka, Milan, Straňák, Pavel, Synková, Pavlína, Ševčíková, Magda, Šindlerová, Jana, Štěpánek, Jan, Štěpánková, Barbora, Toman, Josef, Urešová, Zdeňka, Vidová Hladká, Barbora, Zeman, Daniel, Zikánová, Šárka, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- treebank, dependency, tectogrammatics, topic-focus articulation, multiword expressions, coreference, bridging relations, discourse, morphology, syntax, tokenization, lemmatization, semantic relations, lexical semantics, lexicon, valency, speech reconstruction, clauses, speech recognition, and spoken corpus
- Language:
- Czech
- Description:
- A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
3. SumeCzech
- Creator:
- Straka, Milan, Mediankin, Nikita, Kocmi, Tom, Žabokrtský, Zdeněk, Hudeček, Vojtěch, and Hajič, Jan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- summarization, SumeCzech, and Rouge
- Language:
- Czech
- Description:
- This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al. The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format. The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation. Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
- Rights:
- Mozilla Public License 2.0, http://opensource.org/licenses/MPL-2.0, and PUB