In the existential sentences of Slavonic languages we can find some interesting deviations from the basic type of Indo-European sentences, ie. "Nominative + concordant Verb", for instance Genitive of negation; in some, especially South Slavonic languages there are examples of the main nominal part of positive existential sentence (ie. name of the existing entity) in Genitive or even (as in Slovenian povsod jo je) in Accusative. These deviations can be of interest for the study of the development of Indo-European syntax, as Miklosich and Potebnya already in the 19th century observed. Relevant in this aspect also is the opposition between autosemantic (existential or possessive) esse and (zero or non zero) copula. This phenomena are here studied from the standpoint of the general opposition between polymorphic and monomorphic structures of the syntactical system.
CoNLL 2017 and 2018 shared tasks:
Multilingual Parsing from Raw Text to Universal Dependencies
This package contains the test data in the form in which they ware presented
to the participating systems: raw text files and files preprocessed by UDPipe.
The metadata.json files contain lists of files to process and to output;
README files in the respective folders describe the syntax of metadata.json.
For full training, development and gold standard test data, see
Universal Dependencies 2.0 (CoNLL 2017)
Universal Dependencies 2.2 (CoNLL 2018)
See the download links at http://universaldependencies.org/.
For more information on the shared tasks, see
http://universaldependencies.org/conll17/
http://universaldependencies.org/conll18/
Contents:
conll17-ud-test-2017-05-09 ... CoNLL 2017 test data
conll18-ud-test-2018-05-06 ... CoNLL 2018 test data
conll18-ud-test-2018-05-06-for-conll17 ... CoNLL 2018 test data with metadata
and filenames modified so that it is digestible by the 2017 systems.
The authors present their respective views on the development of the Czech post-war syntactic studies. Their approach is influenced by the fact that they were educated by the different syntactic schools: thus the paper is a combination of Prague’s and Brno´s views. V. Šmilauer´s Novočeská skladba (Syntax of Modern Czech, 1947) is understood as a source of the contemporary research of the Czech syntax. The paper describes the results reached by individual investigators as well as the results of the research teams. According to the authors´ opinion, Two-Level Valency Syntax (represented by F. Daneš and his close collaborators and reflected in the Czech Academic Grammar) and Functional Generative Grammar (developed by P. Sgall and his colleagues) form the main paradigms of the Czech syntax since 1960. Both theories incorporate the results of the classical Praguian functional approach as well as results of the generative paradigm. The authors conclude that the Prague‘s and Brno´s views on the development of Czech syntactic studies are not incompatible but rather complementary and that the methods of formal and corpus linguistics are attractive and useful for the young researchers.
ForFun is a database of linguistic forms and their syntactic functions built with the use of the multi-layer annotated corpora of Czech, the Prague Dependency Treebanks. The purpose of the Prague Database of Forms and Functions (ForFun) is to help the linguists to study the form-function relation, which we assume to be one of the principal tasks of both theoretical linguistics and natural language processing.
A prototypical question to be asked is "What purposes does a preposition 'po' serve for" or "What are the linguistic means in the sentence that can express the meaning 'a destination of an action'?". There are almost 1500 distinct forms (besides the 'po' preposition) and 65 distinct functions (besides the 'destination').
HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style.
Update (November 1017): for a current collection of harmonized dependency treebanks, we recommend using the Universal Dependencies (UD). All of the corpora that are distributed in HamleDT in full are also part of the UD project; only some corpora from the Patch group (where HamleDT provides only the harmonizing scripts but not the full corpus data) are available in HamleDT but not in UD.
This paper presents and discusses the results of an experiment testing the validity of the Trace Deletion Hypothesis (Grodzinsky 1989, 2000) in Czech. The Trace Deletion Hypothesis (= TDH) was proposed to account for a receptive syntactic deficit in Broca’s aphasics that involves structures containing transformational operations such as the passive. According to the assumptions of the TDH, in passive constructions Broca’s aphasics fail to assign a semantic θ-role to the derived subject syntactically, so they assign it the Agent θ-role by linear consideration (Default Principle), which results in a structure with two potential Agents. This strategy is supposed to lead to the chance performance of Broca’s aphasics in these structures, as they are forced to guess the distribution of the Agent and the Patient θ-roles. The results of our experiment, however, do not support the TDH-proposal: out of the six tested subjects, only one performed at chance. The error rate for reversible passive structures in Czech was 33.34%, which corresponds to an above-chance performance. Given these results, the validity of the TDH is called into question, also with respect to the development of the generative theory itself., In diesem Artikel werden Ergebnisse eines Experiments präsentiert und diskutiert, in dem die Validität der sogenannten Spurentilgungshypothese (Trace Deletion Hypothesis – Grodzinsky, 1989, 2000) für das Tschechische überprüft wurde. Die Spurentilgungshypothese (= STH) wurde vorgeschlagen, um rezeptive syntaktische Defizite von Strukturen mit Transformationsoperationen (z. B. das Passiv) zu erklären, die bei Patienten mit Broca-Aphasie auftauchen. Beim Verständnis von Passivkonstruktionen misslingt den Broca-Aphasikern laut der STH die Zuordnung der semantischen θ-Rolle zum syntaktisch derivierten Subjekt. Stattdessen stützen sich die Broca-Aphasiker bei der Zuweisung der Agens θ-Rolle auf die lineare Abfolge der Satzglieder (Default Prinzip), was dazu führt, dass die Struktur aus Sicht der Aphasiker zwei potentielle Agens hat. Diese Strategie führt zu einer zufälligen Wahl, da Broca-Aphasiker die Verteilung zwischen den Agens und Patiens θ-Rollen raten müssen. Die Ergebnisse des hier vorgestellten Experiments unterstützen die Gültigkeit der STH-Hypothese nicht: Von sechs getesteten Probanden wies nur ein Proband eine zufällige Verteilung der semantischen Rollen auf. Die Fehlerrate für reversible Passivkonstruktionen im Tschechischen lag bei 33,34 % – dies entspricht einer überzufälligen Leistung der Probanden bezüglich der Zuordnung der semantischen θ-Rollen. Angesichts dieser Resultate muss die Validität der STH-Hypothese in Frage gestellt werden, und zwar auch im Hinblick auf die allgemeine Entwicklung der generativen Theorie., Andrea Hudousková, Eva Flanderková, Barbara Mertins, Kristýna Tomšů., and Obsahuje seznam literatury
This package contains data used in the IWPT 2020 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal Dependencies release 2.5 (http://hdl.handle.net/11234/1-3105) but some treebanks contain additional enhanced annotations. Moreover, not all of these additions became part of Universal Dependencies release 2.6 (http://hdl.handle.net/11234/1-3226), which makes the shared task data unique and worth a separate release to enable later comparison with new parsing algorithms. The package also contains a number of Perl and Python scripts that have been used to process the data during preparation and during the shared task. Finally, the package includes the official primary submission of each team participating in the shared task.
This package contains data used in the IWPT 2021 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal Dependencies release 2.7 (http://hdl.handle.net/11234/1-3424) but some treebanks contain additional enhanced annotations. Moreover, not all of these additions became part of Universal Dependencies release 2.8 (http://hdl.handle.net/11234/1-3687), which makes the shared task data unique and worth a separate release to enable later comparison with new parsing algorithms. The package also contains a number of Perl and Python scripts that have been used to process the data during preparation and during the shared task. Finally, the package includes the official primary submission of each team participating in the shared task.
Mapping table for the article Hajič et al., 2024: Mapping Czech Verbal Valency to PropBank Argument Labels, in LREC-COLING 2024, as preprocess by the algorithm described in the paper. This dataset i smeant for verification (replicatoin) purposes only. It will b manually processed further to arrive at a workable CzezchpropBank, to be used in Czech UMR annotation, to be further updated during the annotation. The resulting PropBank frame files fir Czech are expected to be available with some future releases of UMR, containing Czech UMR annotation, or separately.
NomVallex 2.0 is a manually annotated valency lexicon of Czech nouns and adjectives, created in the theoretical framework of the Functional Generative Description and based on corpus data (the SYN series of corpora from the Czech National Corpus and the Araneum Bohemicum Maximum corpus). In total, NomVallex is comprised of 1027 lexical units contained in 570 lexemes, covering the following parts-of-speech and derivational categories: deverbal or deadjectival nouns, and deverbal, denominal, deadjectival or primary adjectives. Valency properties of a lexical unit are captured in a valency frame (modeled as a sequence of valency slots, each supplemented with a list of morphemic forms) and documented by corpus examples. In order to make it possible to study the relationship between valency behavior of base words and their derivatives, lexical units of nouns and adjectives in NomVallex are linked to their respective base lexical units (contained either in NomVallex itself or, in case of verbs, in the VALLEX lexicon), linking up to three parts-of-speech (i.e., noun – verb, adjective – verb, noun – adjective, and noun – adjective – verb).
In order to facilitate comparison, this submission also contains abbreviated entries of the base verbs of these nouns and adjectives from the VALLEX lexicon and simplified entries of the covered nouns and adjectives from the PDT-Vallex lexicon.
The NomVallex I. lexicon describes valency of Czech deverbal nouns belonging to three semantic classes, i.e. Communication (dotaz 'question'), Mental Action (plán 'plan') and Psych State (nenávist 'hatred'). It covers both stem-nominals and root-nominals (dotazování se 'asking' and dotaz 'question'). In total, the lexicon includes 505 lexical units in 248 lexemes. Valency properties are captured in the form of valency frames, specifying valency slots and their morphemic forms, and are exemplified by corpus examples.
In order to facilitate comparison, this submission also contains abbreviated entries of the source verbs of these nouns from the Vallex lexicon and simplified entries of the covered nouns from the PDT-Vallex lexicon.
A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ.
The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts, i.e., all annotation from PDT 1.0, PDT 2.0, PDT 2.5, PDT 3.0, PDiT 1.0 and PDiT 2.0, plus corrections, new structure of basic documentation and new list of authors covering all previous editions. The Prague Dependency Treebank 3.5 (PDT 3.5) contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (832,823 words) on all layers, from tectogrammatical annotation to syntax to morphology. There are additional annotated sentences for syntax and morphology; the totals for the lower layers of annotation are: 87,913 sentences with 1,502,976 words at the analytical layer (surface dependency syntax) and 115,844 sentences with 1,956,693 words at the morphological layer of annotation (these totals include the annotation with the higher layers annotated as well). Closely linked to the tectogrammatical layer is the annotation of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations.
The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) is a corpus of spoken language, consisting of 742,316 tokens and 73,835 sentences, representing 7,324 minutes (over 120 hours) of spontaneous dialogs. The dialogs have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcripts and manually reconstructed text. These layers were part of the first version of the corpus (PDTSC 1.0). Version 2.0 is extended by an automatic dependency parser at the analytical and by the manual annotation of “deep” syntax at the tectogrammatical layer, which contains semantic roles and relations as well as annotation of coreference.
Among the results of Russian influence on Czech in the 19th century was the emergence of an active past participle in -(v)ší in Czech. Although not welcomed by all grammarians, this participle continued its existence in Czech until today, becoming mainly a device of archaic and bookish style. In the actual work, the occurence oft the active past participle in -(v)ší in the largest partial corpus of the Czech National Corpus containing journalistic texts is studied. A main result of the study is that apart from a large number of examples from different verbs which show the active past participle on -(v)ší in the studied corpus once or twice and where it is indeed a device of archaic and bookish style, sometimes even of irony and humor, there is a small group of (mainly intransitive) verbs, where this participle functions with considerable frequency in stylistically more neutral contexts of written Standard Czech as the only participle (sometimes as a - stylistically more marked - variant of a more numerous active past participle in -l). In theses cases, it remains overwhelmingly a syntactically unextended direct attribute of a noun. Such active past participle in -(v)ší is to be found most often in sports coverage where it is built from a set of verbs with terminological function.
Experimental materials, data and R scripts used in the paper "Garden-path sentences and the diversity of their
(mis)representations" (Ceháková - Chromý, 2023).
Input data, individual experimental annotations, and a complete and detailed overview of the measured results related to the experiment described in the referenced paper.
Slovak Dependency Treebank (Slovenský závislostný korpus) was created as part of the Slovak National Corpus at the Ľ. Štúr Institute of the Slovak Academy of Sciences. The annotation follows the guidelines of the Prague Dependency Treebank (Czech), slightly modified in the spirit of Slovak grammatical tradition. Morphological tags, lemmas and dependency relations have been assigned manually to every word.
The present dataset is a subset of the original treebank. We automatically selected the sentences where the two human annotators 100% agreed on the analysis. This increases the quality and trustworthiness of the data but it also results in selecting short sentences most of the time. An extended version may be published in the future when manually merged and checked annotation is available.
The selected sentences have been converted to the CoNLL-X file format (original token IDs are preserved in the FEATS column). This PDT-style annotation will serve as the source for the first Slovak dataset in the Universal Dependencies (to be published separately).
Large synchronic textual corpora of the Czech National Corpus are built as representative: they contain a balanced quantity of texts of various styles, divided into three genre subcorpora: fiction, technical/scientific literature and journalism. Comparisons of these genres have been performed on phonological and morphological level; in this paper, I deal with differences between genres on the surface-syntactic level. I use an automatic syntactic annotation of the SYN2005 corpus in the formalism of the analytical layer of the Prague Dependency Treebank. I compare the frequencies of syntactic functions of nouns in the three genres represented by the corresponding subcorpora of SYN2005. I also present a more detailed analysis of four syntactic phenomena: subtypes of the function of attribute in non-prepositional genitive; frequencies of groups of the type pan Novák (Mr. Novák); frequencies of the function of agent in passive constructions expressed by nouns in non-prepositional instrumental and the ratio of the expression of the nominal part of a verbal-nominal predicate by nominative and instrumental. Significant differences found between genres in all the syntactic phenomena analyzed show that in comparing corpora one should carefully monitor their genre composition.