Subject: corpus - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Subject corpus

51. Preamble 1.0

Creator:: Hladká, Barbora and Mírovský, Jiří
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus, multilingual, and subjects
Language:: Czech, English, French, and Polish
Description:: Preamble 1.0 is a multilingual annotated corpus of the preamble of the EU REGULATION 2020/2092 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. The corpus consists of four language versions of the preamble (Czech, English, French, Polish), each of them annotated with sentence subjects. The data were annotated in the Brat tool (https://brat.nlplab.org/) and are distributed in the Brat native format, i.e. each annotated preamble is represented by the original plain text and a stand-off annotation file.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

52. Proměny prózy v letech 1992 až 2018

Creator:: Poukarová, Petra and Cvrček, Václav
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: beletrie, próza, registr, multidimenzionální analýza, korpus, překlad, fiction, prose, register, multidimension analysis, corpus, and translation
Language:: Czech
Description:: This study summarizes a corpus-based analysis of tendencies in register variation of Czech-written fiction texts in the period from 1992 to 2018. The analysis is based on projection of the results from a large sample of Czech prose texts (1070 texts, 12.7 mil. words) on a general register model (established by previous research using multidimensional analysis). The major tendencies found in the material are a decrease of cohesion level, addressee coding and retrospective narration, and increased polythematicity/lexical richness. These findings are supplemented by additional analyses of the role of translation, the position of a text excerpt in the original text (beginning, middle and end) and type of text in the results
Rights:: http://creativecommons.org/licenses/by-nc-sa/4.0/ and policy:public

53. Regionenkorpus (C4-Korpus)

Publisher:: Berlin-Brandenburg Academy of Sciences and Humanities
Format:: application/tei+xml
Type:: corpus
Subject:: corpus
Language:: German
Description:: The C4 corpus is a joined effort of the project Digitales Wörterbuch der deutschen Sprache (DWDS), the Austrian Academy Corpus (AAC), the Korpus Südtirol and the Schweizer Textkorpus (CHTK). The Corpus is composed of corpora of all four partner institutions.
Rights:: Not specified

54. Some current problems of corpus and computational linguistics, or Fifteen commandments and general truths

Creator:: Čermák, František
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: corpus, corpus lingustics, computational linguistics, methodology, type of data, type of information, representativeness of corpora, systems of tagging, lemmatizers, ir/regularity in language, collocations, meaning, aligners, korpus, korpusová lingvistika, komputační lingvistika, metodologie, typy dat, typy informace, reprezentativnost korpusu, systémy taggování, lemmatizátory, ne/pravidelnost v jazyce, kolokace, význam, and alignery
Language:: Czech
Description:: This contribution, which in a brief, succint and almost aphoristic way, critically brings forward to the reader a number of problems of today’s corpus and computational linguistics as well as their unsatisfactory solutions, is trying, at the same time, to do away with a number of myths and simplified opinions in the field. and Příspěvek ve stručné a téměř aforizované podobě připomíná řadu kritizovaných problémů a jejich neuspokojivých řešení v dnešní korpusové a komputační lingvistice a snaží se tak odstranit řadu mýtů a zjednodušujících představ.
Rights:: http://creativecommons.org/publicdomain/mark/1.0/ and policy:public

55. Srovnání žánrů v korpusu na základě syntaktických funkcí substantiv

Creator:: Jelínek, Tomáš
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: syntax, syntaktická funkce, korpus, žánr, reprezentativnost, syntactic function, corpus, genre, and representativeness
Language:: Czech
Description:: Large synchronic textual corpora of the Czech National Corpus are built as representative: they contain a balanced quantity of texts of various styles, divided into three genre subcorpora: fiction, technical/scientific literature and journalism. Comparisons of these genres have been performed on phonological and morphological level; in this paper, I deal with differences between genres on the surface-syntactic level. I use an automatic syntactic annotation of the SYN2005 corpus in the formalism of the analytical layer of the Prague Dependency Treebank. I compare the frequencies of syntactic functions of nouns in the three genres represented by the corresponding subcorpora of SYN2005. I also present a more detailed analysis of four syntactic phenomena: subtypes of the function of attribute in non-prepositional genitive; frequencies of groups of the type pan Novák (Mr. Novák); frequencies of the function of agent in passive constructions expressed by nouns in non-prepositional instrumental and the ratio of the expression of the nominal part of a verbal-nominal predicate by nominative and instrumental. Significant differences found between genres in all the syntactic phenomena analyzed show that in comparing corpora one should carefully monitor their genre composition.
Rights:: http://creativecommons.org/publicdomain/mark/1.0/ and policy:public

56. SYN v4: large corpus of written Czech

Creator:: Křen, Michal, Cvrček, Václav, Čapka, Tomáš, Čermáková, Anna, Hnátková, Milena, Chlumská, Lucie, Jelínek, Tomáš, Kováříková, Dominika, Petkevič, Vladimír, Procházka, Pavel, Skoumalová, Hana, Škrabal, Michal, Truneček, Petr, Vondřička, Pavel, and Zasina, Adrian
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary written (printed) Czech sized 3.6 GW (i.e. 4.3 billion tokens). It covers mostly the period of 1990–2014 and it is a traditional corpus (as opposed to the web-crawled corpora) with rich metadata containing bibliographical information etc. Although it contains a wide range of text types (fiction, non-fiction, newspapers), the newspapers prevail noticeably. The corpus is lemmatized and morphologically annotated by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to registered users of the CNC at http://www.korpus.cz with one important exception: the corpus are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

57. SYN v9: large corpus of written Czech

Creator:: Křen, Michal, Cvrček, Václav, Henyš, Jan, Hnátková, Milena, Jelínek, Tomáš, Kocek, Jan, Kováříková, Dominika, Křivan, Jan, Milička, Jiří, Petkevič, Vladimír, Procházka, Pavel, Skoumalová, Hana, Šindlerová, Jana, and Škrabal, Michal
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v9 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the new CNC tagset first utilized for the annotation of the SYN2020 corpus. SYN v9 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz with one important exception: the corpus is shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

58. SYN2006PUB: corpus of Czech newspapers

Creator:: Čermák, František, Hlaváčová, Jaroslava, Hnátková, Milena, Jelínek, Tomáš, Kocek, Jan, Kopřivová, Marie, Křen, Michal, Novotná, Renata, Petkevič, Vladimír, Schmiedtová, Věra, Skoumalová, Hana, Spoustová, Johanka, Šulc, Michal, and Velíšek, Zdeněk
Publisher:: Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document. and MSM0021620823 – Český národní korpus a korpusy dalších jazyků
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

59. SYN2009PUB: corpus of Czech newspapers

Creator:: Křen, Michal, Bartoň, Tomáš, Hnátková, Milena, Jelínek, Tomáš, Petkevič, Vladimír, Procházka, Pavel, and Skoumalová, Hana
Publisher:: Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary Czech newspapers and magazines sized 700 MW. It contains various titles published between 1995–2007. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document. and MSM0021620823 – Český národní korpus a korpusy dalších jazyků
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

60. SYN2013PUB: corpus of written Czech newspapers

Creator:: Křen, Michal, Hnátková, Milena, Jelínek, Tomáš, Petkevič, Vladimír, Procházka, Pavel, and Skoumalová, Hana
Publisher:: Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Type:: text and corpus
Subject:: corpus and written language
Language:: Czech
Description:: Corpus of contemporary Czech newspapers and magazines sized 935 MW. It contains various titles published between 2005–2009. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document., LM2011023 – Český národní korpus, and http://wiki.korpus.cz/doku.php/en:cnk:syn2013pub
Rights:: Czech National Corpus (Shuffled Corpus Data), https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc, and ACA

« Previous
Next »
1
2
3
4
5
6
7

51. Preamble 1.0

52. Proměny prózy v letech 1992 až 2018

53. Regionenkorpus (C4-Korpus)

54. Some current problems of corpus and computational linguistics, or Fifteen commandments and general truths

55. Srovnání žánrů v korpusu na základě syntaktických funkcí substantiv

56. SYN v4: large corpus of written Czech

57. SYN v9: large corpus of written Czech

58. SYN2006PUB: corpus of Czech newspapers

59. SYN2009PUB: corpus of Czech newspapers

60. SYN2013PUB: corpus of written Czech newspapers

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from