Subject: Czech - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Subject Czech Date Unknown

11. Czech Models (MorfFlex CZ 2.0 + PDT-C 1.0) for MorphoDiTa 220710

Creator:: Straka, Milan and Straková, Jana
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, languageDescription, and mlmodel
Subject:: MorphoDiTa, Czech, morphological analysis, morphological generation, and PoS tagging
Language:: Czech
Description:: Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ 2.0, DeriNet 2.1 and the PoS tagger is trained on Prague Dependency Treebank - Consolidated 1.0. and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013). The Czech morphologic system was devised by Jan Hajič. The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová. The morphologic guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník. The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta. The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

12. Czech Models for Korektor 2

Creator:: Richter, Michal
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, languageDescription, and mlmodel
Subject:: Korektor, Czech, spellchecker, spellchecking, and diacritical marks generation
Language:: Czech
Description:: The Czech models for Korektor 2 created by Michal Richter, 02 Feb 2013. The models can either perform spellchecking and grammarchecking, or only generate diacritical marks. and This work was created by Michal Richter as an extension of his diploma thesis Advanced Czech Spellchecker. The models utilize MorfFlex CZ dictionary (http://hdl.handle.net/11858/00-097C-0000-0015-A780-9) created by Jan Hajič and Jaroslava Hlaváčová.
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

13. Czech Text Document Corpus v 2.0

Creator:: Král, Pavel and Lenc, Ladislav
Publisher:: European Language Resources Association (ELRA)
Type:: text and corpus
Subject:: corpus, Czech, document classification, multi-label, and text
Language:: Czech
Description:: BASIC INFORMATION -------------------- Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles. The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models. Technical Details ------------------------ Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one. For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols. Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files. This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

14. Diakorp v6: diachronic corpus of Czech

Creator:: Kučera, Karel, Řehořková, Anna, and Stluka, Martin
Publisher:: Charles University, Faculty of Arts, Institute of the Czech National Corpus
Type:: text and corpus
Subject:: corpus, diachronic, and Czech
Language:: Czech
Description:: Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not transliterated. Diakorp v6 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

15. Diffusion of phonetic updates within phonological neighborhoods, ELOPE, Data

Creator:: Luef, Eva Maria, Resnik, Pia, and Gráf, Tomáš
Publisher:: Charles University and University of Vienna
Type:: other, text, and languageDescription
Subject:: Austrian German, Czech, phonological neighborhood, English as a second language, aspiration, and minimal pair
Language:: English
Description:: Phonological neighborhood density is known to influence lexical access, speech production as well as perception processes. Lexical competition is thought to be the central concept from which the neighborhood effect emanates: highly competitive neighborhoods are characterized by large degrees of phonemic co-activation, which can delay speech recognition and facilitate speech production. The present study investigates phonetic learning in English as a foreign language in relation to phonological neighborhood density and onset density to see whether dense or sparse neighborhoods are more conducive to the incorporation of novel phonetic detail. In addition, the effect of voice-contrasted minimal pairs (bat-pat) is explored. Results indicate that sparser neighborhoods with weaker lexical competition provide the most optimal phonological environment for phonetic learning. Moreover, novel phonetic details are incorporated faster in neighborhoods without minimal pairs. Results indicate that lexical competition plays a role in the dissemination of phonetic updates in the lexicon of foreign language learners.
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), PUB, and http://creativecommons.org/licenses/by-nc-nd/4.0/

16. FERNET-C5

Creator:: Lehečka, Jan and Švec, Jan
Publisher:: University of West Bohemia, Department of Cybernetics
Type:: text, mlmodel, and languageDescription
Subject:: Czech and BERT
Language:: Czech
Description:: The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042 The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

17. Imperative Benefit Evaluation

Creator:: Chromá, Anna and Perevozchikova, Tatiana
Publisher:: Charles University, Faculty of Arts and Eberhard-Karls-Universität Tübingen
Type:: text, other, and lexicalConceptualResource
Subject:: imperative, cost-benefit, illocutionary force, politeness, speech act, directives, speaker-oriented, and Czech
Language:: Czech
Description:: The contribution includes the data frame and the R script (Markdown file) belonging to the paper "Who Benefits from an Imperative? Assessment of Directives on a Benefit-Scale" submitted to the journal Pragmatics on September 2024.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

18. Introducing a phonotactic probability calculator for Czech

Creator:: Čechová, Petra, Cilibrasi, Luca, Henyš, Jan, and Čecho, Jaroslav
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: Czech, language processing, phonotactic probability, phonotactics, pseudoword, čeština, fonotaktická probabilita, fonotaktika, pseudoslovo, and zpracování jazyka
Language:: Czech
Description:: Phonotactic probability refers to the frequency with which phonological segments and sequences of phonological segments occur in words in a given language (Vitevitch – Luce, 2004). It has been shown that phonotactic probabilities of words are important in language processing and language acquisition (Jusczyk et al., 1994; Mattys – Jusczyk, 2001; Pitt – McQueen, 1998). For example, words with high phonotactic probability are processed faster by native speakers in same-different tasks (Luce – Large, 2001), and pseudowords with high phonotactic probability are judged as more word-like by adults (Vitevitch et al., 1997). In this paper we present a phonotactic calculator for Czech implemented as a Python script. The script relies on frequency data from three freely available corpora of Czech: SYN2015 and SYN2020, corpora of written Czech (Křen et al., 2015; 2020), and ORAL v1, a corpus of spoken Czech (Kopřivová et al., 2017). The steps of the calculation mirror those developed by Vitevitch and Luce (2004) for English, and the script can provide phonotactic (and additionally orthotactic) probability for any Czech word or pseudoword. The script can be downloaded at <https://phonocalc.github.io>. and Pojem fonotaktická probabilita odkazuje k frekvenci, s níž se fonologické segmenty a sekvence těchto segmentů objevují ve slovech v určitém jazyce (Vitevitch – Luce, 2004). Předchozí výzkumy ukázaly, že fonotaktická probabilita slov hraje důležitou roli při zpracování a akvizici jazyka (Jusczyk et al., 1994; Mattys – Jusczyk, 2001; Pitt – McQueen, 1998). Slova s vysokou fonotaktickou probabilitou jsou například rychleji zpracována rodilými mluvčími v úloze „same-different“ (Luce – Large, 2001) a pseudoslova s vysokou fonotaktickou probabilitou jsou dospělými hodnocena jako pravděpodobnější slova daného jazyka (Vitevitch et al., 1997). V tomto článku představujeme nástroj pro výpočet fonotaktické probability pro češtinu, který je volně dostupný jako skript v programovacím jazyce Python. Nástroj vychází z údajů o frekvenci slov ze tří volně dostupných korpusů českého jazyka: korpus psaného jazyka SYN2015 (Křen et al., 2015), korpus psaného jazyka SYN2020 (Křen et al., 2020) a korpus mluveného jazyka ORAL v1 (Kopřivová et al., 2017). Výpočet replikuje postup původního kalkulátoru pro anglický jazyk (Vitevitch – Luce, 2004) a výstupem je odhad fonotaktické (a navíc také ortotaktické) probability pro jakékoliv české slovo či pseudoslovo. Skript je dostupný z internetové stránky <https://phonocalc.github.io>.
Rights:: http://creativecommons.org/licenses/by-nc-sa/4.0/ and policy:public

19. Jak zkoumat překladovou češtinu: výzkum simplifikace na korpusu Jerome

Creator:: Chlumská, Lucie and Richterová, Olga
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: Czech, language of translation, translation universals, simplification, comparable corpus, čeština, překladový jazyk, překladové univerzálie, simplifikace, and srovnatelný korpus
Language:: Czech
Description:: This paper discusses the possibilities of the research of translated Czech as well as so-called translation universals in Czech. It introduces a monolingual comparable corpus Jerome specifically designed at the Institute of the Czech National Corpus to meet the requirements of translation studies researchers. The case study of simplification presents the results of examining this translation universal in translated Czech and shows the advantages as well as disadvantages of the quantitative approach.
Rights:: http://creativecommons.org/publicdomain/mark/1.0/ and policy:public

20. K jazykové situaci v nadnárodních podnicích působících v České republice

Creator:: Nekvapil, Jiří and Nekula, Marek
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: language planning, multinational companies, Czech Republic, corporate language, Czech, English, German, jazykové plánování, nadnárodní podniky, Česká republika, firemní jazyk, čeština, angličtina, and němčina
Language:: Czech
Description:: This article deals with intercultural contact in branches of multinational companies or corporations founded in the Czech Republic by German, Austrian or Swiss owners. Multinationalbusinesses (large ones in particular) are trying to regulate the communication within the company. This is achieved predominantly by introducing an official corporate language in the company, employing people fluent in the language, and promoting language courses. Our research, based on the analysis of questionnaires and semi-structured interview data, has shown that the foreign employees seldom adapt to the language of the local employees, while the adaptation of the local employees to the language of the foreign ones is not only usual but also expected. The regulation of the communication therefore results in the promotion of primarily asymmetrical language adaptation, which benefits the German, Austrian and Swiss owners and the German-speaking foreign employees delegated by them (the so-called expatriates). However, the companies examined also promote the use of English to a considerable extent, which provides a basis for symmetrical communication between local and expatriate employees. and Der Artikel handelt über den interkulturellen Kontakt in multinationalen Unternehmen, die nach 1989 in der Tschechischen Republik durch deutsche, österreichische und schweizerische Unternehmen gegründet wurden. Multinationale Unternehmen (insbesondere die großen) versuchen die Kommunikation innerhalb des Unternehmens zu regulieren. Dies geschieht vor allem durch Einführung einer Firmensprache im Unternehmen, Anstellung von Mitarbeitern, die der Sprache mächtig sind, und Förderung von Sprachkursen. In 9 % der Unternehmen ist das Tschechische die einzige Firmensprache, in 55 % übernimmt diese Aufgabe das Deutsche, in 16 % das Englische, in 15 % Deutsch und Englisch, in 5 % Deutsch und Tschechisch. Was die Sprachkurse betrifft, werden in 64 % der Unternehmen Deutschkurse, in 19 % Tschechischkurse und in 48 % Englischkurse gefördert. Unsere auf Fragebögen und teilstrukturierten Interviews basierende Untersuchung hat gezeigt, dass sich die ausländischen, nach Tschechien entsandten Mitarbeiter nur selten an die Sprache der lokalen Mitarbeiter adaptieren, während die Adaptation der in Tschechien einheimischen Mitarbeiter an die Sprache der ausländischen Mitarbeiter nicht nur üblich ist, sondern auch erwartet wird. Die Regulierung der Kommunikation mündet also primär in eine asymmetrische sprachliche Adaptation zum Vorteil deutscher, österreichischer und schweizerischer Besitzer und deutschsprachiger ausländischer Mitarbeiter (sog. Expatriates), die durch die Besitzer nach Tschechien delegiert werden. Die untersuchten Unternehmen unterstützen jedoch in beachtlichem Ausmaß auch die Verwendung des Englischen, das eine Basis für symmetrische Kommunikation zwischen den in Tschechien einheimischen und nach Tschechien entsandten Mitarbeitern bildet. Diese Adaptation betrifft jedoch konkret vor allem die Managementebene, während die Produktion weitgehend tschechisch geprägt bleibt. Weit verbreitet ist auch die Nicht-Adaptation, die zum Einsatz von Dolmetschern und Übersetzern führt. Dies ist – neben der asymmetrischen Adaptation und dem Rückgriff auf das Englische – in 80 % der Unternehmen bzw. in 95 % der großen Unternehmen der Fall. Eine Detailbeschreibung der Kommunikation in einem der auf dem Gebiet der Tschechischen Republik tätigen Unternehmen des Siemens-Konzerns macht deutlich, wie die Funktionsstellen in einem Produktionsunternehmen besetzt und mit welcher sprachlichen Qualifikation diese verbunden werden, sie zeigt aber auch, wie sich die Firmensprache ändert, wie die interkulturelle Kommunikation unter Einsatz von sprachlich qualifizierten Mitarbeitern konkret abläuft und wie diese – etwa in Sprachkursen – auf ihre Aufgaben vorbereitet werden.
Rights:: http://creativecommons.org/licenses/by-nc-sa/4.0/ and policy:public

11. Czech Models (MorfFlex CZ 2.0 + PDT-C 1.0) for MorphoDiTa 220710

12. Czech Models for Korektor 2

13. Czech Text Document Corpus v 2.0

14. Diakorp v6: diachronic corpus of Czech

15. Diffusion of phonetic updates within phonological neighborhoods, ELOPE, Data

16. FERNET-C5

17. Imperative Benefit Evaluation

18. Introducing a phonotactic probability calculator for Czech

19. Jak zkoumat překladovou češtinu: výzkum simplifikace na korpusu Jerome

20. K jazykové situaci v nadnárodních podnicích působících v České republice

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Creator

Show values starting with

Format

Language

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Original context has metadata only

Harvested from