Subject: Czech / Type: text - LINDAT/CLARIAH-CZ Catalog Search Results

31. ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)

Creator:: Kopřivová, Marie, Komrsková, Zuzana, Lukeš, David, Poukarová, Petra, and Škarpová, Marie
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: text and corpus
Subject:: balanced corpus, spoken language, informal language, and Czech
Language:: Czech
Description:: ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus is composed of 332 recordings from 2012–2017 and contains 1 014 786 orthographic words (i.e. a total of 1 236 508 tokens including punctuation); a total of 624 different speakers appear in the probes. ORTOFON v1 is fully balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v1 is lemmatized and morphologically tagged. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-2579
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

32. Prague Dependency Treebank 2.0 (PDT 2.0)

Creator:: Hajič, Jan, Panevová, Jarmila, Hajičová, Eva, Sgall, Petr, Pajas, Petr, Štěpánek, Jan, Havelka, Jiří, Mikulová, Marie, Žabokrtský, Zdeněk, Ševčíková-Razímová, Magda, and Urešová, Zdeňka
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus, Czech, treebank, and PDT
Language:: Czech
Description:: The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (two million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well. and 1ET101120413 (Data a nástroje pro informační systémy) MSM 0021620838 (Moderní metody, struktury a systémy informatiky) 1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů) 1P05ME752 (Vícejazyčný valenční a predikátový slovník přirozeného jazyka) LC536 (Centrum komputační lingvistiky)
Rights:: PDT 2.0 License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2, and ACA

33. Processing of intraclausal garden-path structures in Czech

Creator:: Ceháková, Markéta and Chromý, Jan
Publisher:: Charles University, Faculty of Arts, Institute of Czech Language and Theory of Communication
Type:: text, other, and languageDescription
Subject:: psycholinguistic experiments, sentence processing, Czech, garden-path, reading comprehension, and syntax
Language:: Czech
Description:: Experimental materials, data and R scripts used in the paper "Garden-path sentences and the diversity of their (mis)representations" (Ceháková - Chromý, 2023).
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

34. RobeCzech Base

Creator:: Straka, Milan, Náplava, Jakub, Straková, Jana, and Samuel, David
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, mlmodel, and languageDescription
Subject:: Czech, BERT, and RoBERTa
Language:: Czech
Description:: RobeCzech is a monolingual RoBERTa language representation model trained on Czech data. RoBERTa is a robustly optimized Transformer-based pretraining approach. We show that RobeCzech considerably outperforms equally-sized multilingual and Czech-trained contextualized language representation models, surpasses current state of the art in all five evaluated NLP tasks and reaches state-of-theart results in four of them. The RobeCzech model is released publicly at https://hdl.handle.net/11234/1-3691 and https://huggingface.co/ufal/robeczech-base, both for PyTorch and TensorFlow.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

35. Semantic annotation of noun/verb conversion in Czech

Creator:: Ševčíková, Magda, Kyjánek, Lukáš, Hledíková, Hana, and Staňková, Anna
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: other, text, and lexicalConceptualResource
Subject:: conversion, semantic, noun, verb, word formation, and Czech
Language:: Czech
Description:: The item contains a list of 2,058 noun/verb conversion pairs along with related formations (word-formation paradigms) provided with linguistic features, including semantic categories that characterize semantic relations between the noun and the verb in each conversion pair. Semantic categories were assigned manually by two human annotators based on a set of sentences containing the noun and the verb from individual conversion pairs. In addition to the list of paradigms, the item contains a set of 739 files (a separate file for each conversion pair) annotated by the annotators in parallel and a set of 2,058 files containing the final annotation, which is included in the list of paradigms.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/

36. sqad 2.1

Creator:: Medveď, Marek, Horák, Aleš, and Kušniráková, Dáša
Publisher:: Natural Language Processing Centre, Faculty of Informatics, Masaryk University
Type:: text and corpus
Subject:: Czech, Simple Question Answering Database, and question answering
Language:: Czech
Description:: Simple question answering database version 2.1 (SQAD_v2.1) created from Czech Wikipedia. Each record of SQAD consist of four files (in vertical form provided with lemmatization and POS tagging) and two metadata files.
Rights:: GNU Library or "Lesser" General Public License 3.0 (LGPL-3.0), http://opensource.org/licenses/LGPL-3.0, and PUB

37. sqad 3.0

Creator:: Medveď, Marek and Horák, Aleš
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: Simple Question Answering Database, Czech, and question answering
Language:: Czech
Description:: Simple question answering database version 3 (SQAD v3) created from Czech Wikipedia. New version consits of 13477 records. Each record of SQAD consist of multiple files - question, answer extraction, answer selection, ulr, question metadata and in some cases answer context.
Rights:: GNU Library or "Lesser" General Public License 3.0 (LGPL-3.0), http://opensource.org/licenses/LGPL-3.0, and PUB

38. SQAD v2

Creator:: Medveď, Marek, Horák, Aleš, and Šulganová, Terézia
Publisher:: Natural Language Processing Centre, Faculty of Informatics, Masaryk University
Type:: text and corpus
Subject:: question answering, Czech, and Simple Question Answering Database
Language:: Czech
Description:: Simple question answering database (SQAD) created from Czech Wikipedia. Each record of SQAD consist of four files (in vertical form provided with lemmatization and POS tagging) and two metadata files.
Rights:: GNU General Public Licence, version 3, http://opensource.org/licenses/GPL-3.0, and PUB

39. VALLEX 2.5

Creator:: Lopatková, Markéta, Žabokrtský, Zdeněk, and Kettnerová, Václava
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: valency and Czech
Language:: Czech
Description:: The Valency Lexicon of Czech Verbs, Version 2.5 (VALLEX 2.5), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.5 has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague. VALLEX 2.5 provides information on the valency structure (combinatorial potential) of verbs in their particular senses - there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses"). and LC 536 - Center for Computational Linguistics, 1ET100300517 and 1ET101120503.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

40. VALLEX 3.0

Creator:: Lopatková, Markéta, Kettnerová, Václava, Bejček, Eduard, Vernerová, Anna, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, machineReadableDictionary, and lexicalConceptualResource
Subject:: valency, diatheses, alternations, grammar rules, Czech, lexicon, semantics, and syntax
Language:: Czech
Description:: VALLEX 3.0 provides information on the valency structure (combinatorial potential) of verbs in their particular senses, which are characterized by glosses and examples. VALLEX 3.0 describes almost 4 600 Czech verbs in more than 10 800 lexical units, i.e., given verbs in the given senses. VALLEX 3.0 is a is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. In order to satisfy different needs of different potential users, the lexicon is distributed (i) in a HTML version (the data allows for an easy and fast navigation through the lexicon) and (ii) in a machine-tractable form as a single XML file, so that the VALLEX data can be used in NLP applications.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

31. ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)

32. Prague Dependency Treebank 2.0 (PDT 2.0)

33. Processing of intraclausal garden-path structures in Czech

34. RobeCzech Base

35. Semantic annotation of noun/verb conversion in Czech

36. sqad 2.1

37. sqad 3.0

38. SQAD v2

39. VALLEX 2.5

40. VALLEX 3.0

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from