Harvested from: LINDAT/CLARIAH-CZ repository / Original context has metadata only: false

1331. STYX 1.0

Creator:: Hladká, Barbora, Kučera, Ondřej, and Kuchyňová, Karolína
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: annotated corpus, syntax, and sentence diagramming
Language:: Czech
Description:: STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech morphology and syntax in elementary schools. The sentences contain both the PDT annotations and the school sentence analyses. The school sentence analyses were created by transforming the PDT annotations using handcrafted rules. Altogether the STYX 1.0 corpus contains 11 655 sentences. Originally, the STYX 1.0 corpus was an inseparable part of the Styx system (http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F)
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

1332. STYX 1.0 (2017-10-03)

Creator:: Hladká, Barbora, Kučera, Ondřej, and Kuchyňová, Karolína
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: annotated corpus, syntax, and sentence diagramming
Language:: Czech
Description:: STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech morphology and syntax in elementary schools. The sentences contain both the PDT annotations and the school sentence analyses. The school sentence analyses were created by transforming the PDT annotations using handcrafted rules. Altogether the STYX 1.0 corpus contains 11 655 sentences. Originally, the STYX 1.0 corpus was an inseparable part of the Styx system (http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F)
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

1333. SumeCzech

Creator:: Straka, Milan, Mediankin, Nikita, Kocmi, Tom, Žabokrtský, Zdeněk, Hudeček, Vojtěch, and Hajič, Jan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: summarization, SumeCzech, and Rouge
Language:: Czech
Description:: This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al. The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format. The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation. Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
Rights:: Mozilla Public License 2.0, http://opensource.org/licenses/MPL-2.0, and PUB

1334. SumeCzech-NER

Creator:: Marek, Petr and Müller, Štěpán
Publisher:: Czech Technical University in Prague
Type:: text and corpus
Subject:: SumeCzech, named entity recognition, named entitity corpus, and summarization
Language:: Czech
Description:: SumeCzech-NER SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset). Format The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are: - dataset: train, dev, test, oodtest - ne_abstract: list of named entity annotations of article's abstract - ne_headline: list of named entity annotations of article's headline - ne_text: list of name entity annotations of article's text - url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER Annotations We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions. Tokenization We used the following Python code for tokenization: from typing import List from nltk.tokenize import word_tokenize def tokenize(text: str) -> List[str]: for mark in ('.', ',', '?', '!', '-', '–', '/'): text = text.replace(mark, f' {mark} ') tokens = word_tokenize(text) return tokens
Rights:: Mozilla Public License 2.0, http://opensource.org/licenses/MPL-2.0, and PUB

1335. Suzanne Marwille (actress)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: film Varhaník u sv. Víta ukázka, film Černí myslivci ukázka, Galerie osobností, People::Marwille Suzanne (1895-1962), and Varhaník u sv. Víta
Language:: No linguistic content
Description:: Actress Suzanne Marwille in Varhaník u sv. víta (The Organist at St. Vitus´ Cathedral, dir. Martin Frič, 1929). Marwille in Černí myslivci (The Black Rangers, dir. Václav Binovec, 1921).
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

1336. Svatopluk Innemann (director)

Creator:: Veselý, Bohumil
Publisher:: Národní filmový archiv
Type:: video and clip
Subject:: film Josef Kajetán Tyl ukázka, film Byl první máj ukázka, Galerie osobností, Places::Praha::Nové Město::Školská::pavlač domu, People::Innemann Svatopluk (1869-1945), People::Kavková Zdena (1896-1965), and Josef Kajetán Tyl
Language:: No linguistic content
Description:: Director Svatopluk Innemann with his wife, actress Zdena Kavková, on Bohumil Veselý's balcony. A clip from Josef Kajetán Tyl (dir. Svatopluk Inneman, 1925). Innemann in Byl první máj (It Was the First of May, dir. Thea Červenková, 1919).
Rights:: http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

1337. SYN v4: large corpus of written Czech

Creator:: Křen, Michal, Cvrček, Václav, Čapka, Tomáš, Čermáková, Anna, Hnátková, Milena, Chlumská, Lucie, Jelínek, Tomáš, Kováříková, Dominika, Petkevič, Vladimír, Procházka, Pavel, Skoumalová, Hana, Škrabal, Michal, Truneček, Petr, Vondřička, Pavel, and Zasina, Adrian
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary written (printed) Czech sized 3.6 GW (i.e. 4.3 billion tokens). It covers mostly the period of 1990–2014 and it is a traditional corpus (as opposed to the web-crawled corpora) with rich metadata containing bibliographical information etc. Although it contains a wide range of text types (fiction, non-fiction, newspapers), the newspapers prevail noticeably. The corpus is lemmatized and morphologically annotated by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to registered users of the CNC at http://www.korpus.cz with one important exception: the corpus are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

1338. SYN v9: large corpus of written Czech

Creator:: Křen, Michal, Cvrček, Václav, Henyš, Jan, Hnátková, Milena, Jelínek, Tomáš, Kocek, Jan, Kováříková, Dominika, Křivan, Jan, Milička, Jiří, Petkevič, Vladimír, Procházka, Pavel, Skoumalová, Hana, Šindlerová, Jana, and Škrabal, Michal
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v9 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the new CNC tagset first utilized for the annotation of the SYN2020 corpus. SYN v9 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz with one important exception: the corpus is shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

1339. SYN2005: balanced corpus of written Czech

Creator:: Čermák, František, Hlaváčová, Jaroslava, Hnátková, Milena, Jelínek, Tomáš, Kocek, Jan, Kopřivová, Marie, Křen, Michal, Novotná, Renata, Petkevič, Vladimír, Schmiedtová, Věra, Skoumalová, Hana, Spoustová, Johanka, Šulc, Michal, and Velíšek, Zdeněk
Publisher:: Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Type:: text and corpus
Subject:: balanced corpus and written language
Language:: Czech
Description:: Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2000–2004 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document. and MSM0021620823 – Český národní korpus a korpusy dalších jazyků
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

1340. SYN2006PUB: corpus of Czech newspapers

Creator:: Čermák, František, Hlaváčová, Jaroslava, Hnátková, Milena, Jelínek, Tomáš, Kocek, Jan, Kopřivová, Marie, Křen, Michal, Novotná, Renata, Petkevič, Vladimír, Schmiedtová, Věra, Skoumalová, Hana, Spoustová, Johanka, Šulc, Michal, and Velíšek, Zdeněk
Publisher:: Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document. and MSM0021620823 – Český národní korpus a korpusy dalších jazyků
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

1331. STYX 1.0

1332. STYX 1.0 (2017-10-03)

1333. SumeCzech

1334. SumeCzech-NER

1335. Suzanne Marwille (actress)

1336. Svatopluk Innemann (director)

1337. SYN v4: large corpus of written Czech

1338. SYN v9: large corpus of written Czech

1339. SYN2005: balanced corpus of written Czech

1340. SYN2006PUB: corpus of Czech newspapers

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Creator

Show values starting with

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from