« Previous |
1 - 10 of 11
|
Next »
Number of results to display per page
Search Results
2. Lexico-Semantic Annotation of PDT using Czech WordNet
- Creator:
- Bejček, Eduard, Hoffmannová, Petra, Holub, Martin, Hučínová, Marie, Pecina, Pavel, Straňák, Pavel, Šidák, Pavel, and Hajič, Jan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- PDT and Czech WordNet
- Language:
- Czech
- Description:
- This dataset contains annotation of PDT using Czech WordNet ontology: http://hdl.handle.net/11858/00-097C-0000-0001-4880-3 Data is stored in PML format. This is a stand-off annotation and for most use cases it requires PDT 2.0 and the Czech WordNet 1.9 PDT that we have used for annotation. and 1ET100300517, 1ET201120505
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
3. PDT-Vallex: Czech Valency lexicon linked to treebanks
- Creator:
- Urešová, Zdeňka, Štěpánek, Jan, Hajič, Jan, Panevova, Jarmila, and Mikulová, Marie
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- annotation, corpora, data, lexicon, semantics, valency, and PDT
- Language:
- Czech
- Description:
- The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool), and also in more human readable form including corpus examples (see the WEBSITE link below). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
4. PDT-Vallex: Czech Valency lexicon linked to treebanks 4.0 (PDT-Vallex 4.0)
- Creator:
- Urešová, Zdeňka, Bémová, Alevtina, Fučíková, Eva, Hajič, Jan, Kolářová, Veronika, Mikulová, Marie, Pajas, Petr, Panevová, Jarmila, and Štěpánek, Jan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, computationalLexicon, and lexicalConceptualResource
- Subject:
- verbal valency, valency, annotation, linguistic data, lexicon, lexical semantics, and PDT
- Language:
- Czech
- Description:
- The valency lexicon PDT-Vallex 4.0 has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT, the spoken language corpus (PDTSC) and corpus of user-generated texts in the project Faust). It contains over 14500 valency frames for almost 8500 verbs which occurred in the PDT, PCEDT, PDTSC and Faust corpora. In addition, there are nouns, adjectives and adverbs, linked from the PDT part only, increasing the total to over 17000 valency frames for 13000 words. All the corpora have been published in 2020 as the PDT-C 1.0 corpus with the PDT-Vallex 4.0 dictionary included; this is a copy of the dictionary published as a separate item for those not interested in the corpora themselves. It is available in electronically processable format (XML), and also in more human readable form including corpus examples (see the WEBSITE link below, and the links to its main publications elsewhere in this metadata). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives. It replaces the previously published unversioned edition of PDT-Vallex from 2014.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
5. Prague Czech-English Dependency Treebank 2.0
- Creator:
- Hajič, Jan, Hajičová, Eva, Panevová, Jarmila, Sgall, Petr, Cinková, Silvie, Fučíková, Eva, Mikulová, Marie, Pajas, Petr, Popelka, Jan, Semecký, Jiří, Šindlerová, Jana, Štěpánek, Jan, Toman, Josef, Urešová, Zdeňka, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- parallel treebank, PCEDT, parallel corpus, Wall Street Journal, WSJ, Penn Treebank, dependency annotation, and PDT
- Language:
- Czech and English
- Description:
- Texts The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part. Data The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 sentence-aligned. An additional automatic alignment on the node level (different for each annotation layer) is part of this release, too. The original Penn Treebank-like file structure (25 sections, each containing up to one hundred files) has been preserved. Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been translated to Czech and made part of this release. Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are: dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values) semantic labeling of content words and types of coordinating structures argument structure, including an argument structure ("valency") lexicon for both languages ellipsis and anaphora resolution. This annotation style is called tectogrammatical annotation and it constitutes the tectogrammatical layer in the corpus. For more details see below and documentation. Annotation of the Czech part Sentences of the Czech translation were automatically morphologically annotated and parsed into surface-syntax dependency trees in the PDT 2.0 annotation style. This annotation style is sometimes called analytical annotation; it constitutes the analytical layer of the corpus. The manual tectogrammatical (deep-syntax) annotation was built as a separate layer above the automatic analytical (surface-syntax) parse. A sample of 2,000 sentences was manually annotated on the analytical layer. Annotation of the English part The resulting manual tectogrammatical annotation was built above an automatic transformation of the original phrase-structure annotation of the Penn Treebank into surface dependency (analytical) representations, using the following additional linguistic information from other sources: PropBank (LDC2004T14) VerbNet NomBank (LDC2008T23) flat noun phrase structures (by courtesy of D. Vadas and J.R. Curran) For each sentence, the original Penn Treebank phrase structure trees are preserved in this corpus together with their links to the analytical and tectogrammatical annotation. and Ministry of Education of the Czech Republic projects No.: MSM0021620838 LC536 ME09008 LM2010013 7E09003+7E11051 7E11041 Czech Science Foundation, grants No.: GAP406/10/0875 GPP406/10/P193 GA405/09/0729 Research funds of the Faculty of Mathematics and Physics, Charles University, Czech Republic, Grant Agency of the Academy of Sciences of the Czech Republic: No. 1ET101120503 Students participating in this project have been running their own student grants from the Grant Agency of the Charles University, which were connected to this project. Only ongoing projects are mentioned: 116310, 158010, 3537/2011 Also, this work was funded in part by the following projects sponsored by the European Commission: Companions, No. 034434 EuroMatrix, No. 034291 EuroMatrixPlus, No. 231720 Faust, No. 247762
- Rights:
- CC-BY-NC-SA + LDC99T42, https://lindat.mff.cuni.cz/repository/xmlui/page/license-pcedt2, and RES
6. Prague Dependency Treebank 2.0 (PDT 2.0)
- Creator:
- Hajič, Jan, Panevová, Jarmila, Hajičová, Eva, Sgall, Petr, Pajas, Petr, Štěpánek, Jan, Havelka, Jiří, Mikulová, Marie, Žabokrtský, Zdeněk, Ševčíková-Razímová, Magda, and Urešová, Zdeňka
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- corpus, Czech, treebank, and PDT
- Language:
- Czech
- Description:
- The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (two million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well. and 1ET101120413 (Data a nástroje pro informační systémy) MSM 0021620838 (Moderní metody, struktury a systémy informatiky) 1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů) 1P05ME752 (Vícejazyčný valenční a predikátový slovník přirozeného jazyka) LC536 (Centrum komputační lingvistiky)
- Rights:
- PDT 2.0 License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2, and ACA
7. Prague Dependency Treebank 2.0 - sample data
- Creator:
- Hajič, Jan, Panevová, Jarmila, Sgall, Petr, Pajas, Petr, Štěpánek, Jan, Havelka, Jiří, Mikulová, Marie, Žabokrtský, Zdeněk, and Ševčíková-Razímová, Magda
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- treebank, dependency, and PDT
- Language:
- Czech
- Description:
- A small subset of PDT 2.0 made available under a permissive license. Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well. and * Ministry of Education of the Czech Republic projects No. VS96151, LN00A063, 1P05ME752, MSM0021620838 and LC536, * Grant Agency of the Czech Republic grants Nos. 405/96/0198, 405/96/K214 and 405/03/0913, * research funds of the Faculty of Mathematics and Physics, * Charles University, Prague, Czech Republic, * Grant Agency of the Czech Academy of Science, Prague, Czech Republic projects No. 1ET101120503, 1ET101120413, and 1ET201120505 * Grant Agency of the Charles University No. 489/04, 350/05, 352/05 and 375/05 * the U.S. NSF Grant #IIS9732388.
- Rights:
- Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB
8. Prague Dependency Treebank 2.5
- Creator:
- Bejček, Eduard, Hajič, Jan, Panevová, Jarmila, Mírovský, Jiří, Spoustová, Johanka, Štěpánek, Jan, Straňák, Pavel, Šidák, Pavel, Vimmrová, Pavlína, Šťastná, Eva, Ševčíková, Magda, Smejkalová, Lenka, Homola, Petr, Popelka, Jan, Lopatková, Markéta, Hrabalová, Lucie, Klyueva, Natalia, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- treebank, multiword expressions, clauses, tectogrammatics, dependency, and PDT
- Language:
- Czech
- Description:
- The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects (see Documentation). Moreover, new information was added to the data: Annotation of multiword expressions Pair/group meaning Clause segmentation and Ministry of Education of the Czech Republic projects No.: LM2010013 LC536 MSM0021620838 Grant Agency of the Czech Republic grants No.: P406/2010/0875 P202/10/1333 P406/10/P193
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
9. Prague Dependency Treebank 3.0
- Creator:
- Bejček, Eduard, Hajičová, Eva, Hajič, Jan, Jínová, Pavlína, Kettnerová, Václava, Kolářová, Veronika, Mikulová, Marie, Mírovský, Jiří, Nedoluzhko, Anna, Panevová, Jarmila, Poláková, Lucie, Ševčíková, Magda, Štěpánek, Jan, and Zikánová, Šárka
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- treebank, dependency, tectogrammatics, topic-focus articulation, multiword expressions, coreference, bridging relations, discourse, and PDT
- Language:
- Czech
- Description:
- PDT 3.0 is a new version of Prague Dependency Treebank. It contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and semantic annotation (0.8 MW); in addition, certain properties of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations are annotated at the semantic level. and the Grant Agency of the Czech Republic: grants P406/12/0658 "Coreference, discourse relations and information structure in a contrastive perspective", P406/2010/0875 "Computational Linguistics: Explicit description of language and annotated data focused on Czech", 405/09/0729 "From the structure of a sentence to textual relationships", and GPP406/12/P175 (Selected derivational relations for automatic processing of Czech); the Ministry of Education, Youth and Sports of the Czech Republic: the KONTAKT project ME10018 "Towards a computational analysis of text structure" and the LINDAT-Clarin project LM2010013; the Grant Agency of Charles University in Prague: GAUK 103609 "Textual (Inter-sentential) Relations and their Representation in a Language Corpus" and GAUK 4383/2009 "Methods of coreference resolution".
- Rights:
- Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB
10. Prague Discourse Treebank 2.0
- Creator:
- Rysová, Magdaléna, Synková, Pavlína, Mírovský, Jiří, Hajičová, Eva, Nedoluzhko, Anna, Ocelák, Radek, Pergler, Jiří, Poláková, Lucie, Scheller, Veronika, Zdeňková, Jana, and Zikánová, Šárka
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- discourse, bridging relations, coreference, topic-focus articulation, treebank, dependency, tectogrammatics, and PDT
- Language:
- Czech
- Description:
- PDiT 2.0 is a new version of the Prague Discourse Treebank. It contains a complex annotation of discourse phenomena enriched by the annotation of secondary connectives.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB