Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging.
The morphological dictionary is created from MorfFlex CZ 2.0, DeriNet 2.1 and the PoS tagger is trained on Prague Dependency Treebank - Consolidated 1.0. and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The Czech morphologic system was devised by Jan Hajič.
The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová.
The morphologic guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník.
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
The Czech models for Korektor 2 created by Michal Richter, 02 Feb 2013. The models can either perform spellchecking and grammarchecking, or only generate diacritical marks. and This work was created by Michal Richter as an extension of his diploma thesis Advanced Czech Spellchecker. The models utilize MorfFlex CZ dictionary (http://hdl.handle.net/11858/00-097C-0000-0015-A780-9) created by Jan Hajič and Jaroslava Hlaváčová.
BASIC INFORMATION
--------------------
Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer.
The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles.
The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models.
Technical Details
------------------------
Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one.
For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols.
Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files.
This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded.
Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not transliterated. Diakorp v6 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz
Phonological neighborhood density is known to influence lexical access, speech production as well as perception processes. Lexical competition is thought to be the central concept from which the neighborhood effect emanates: highly competitive neighborhoods are characterized by large degrees of phonemic co-activation, which can delay speech recognition and facilitate speech production. The present study investigates phonetic learning in English as a foreign language in relation to phonological neighborhood density and onset density to see whether dense or sparse neighborhoods are more conducive to the incorporation of novel phonetic detail. In addition, the effect of voice-contrasted minimal pairs (bat-pat) is explored. Results indicate that sparser neighborhoods with weaker lexical competition provide the most optimal phonological environment for phonetic learning. Moreover, novel phonetic details are incorporated faster in neighborhoods without minimal pairs. Results indicate that lexical competition plays a role in the dissemination of phonetic updates in the lexicon of foreign language learners.
The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization.
More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042
The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
The contribution includes the data frame and the R script (Markdown file) belonging to the paper "Who Benefits from an Imperative? Assessment of Directives on a Benefit-Scale" submitted to the journal Pragmatics on September 2024.
Phonotactic probability refers to the frequency with which phonological segments and sequences of phonological segments occur in words in a given language (Vitevitch – Luce, 2004). It has been shown that phonotactic probabilities of words are important in language processing and language acquisition (Jusczyk et al., 1994; Mattys – Jusczyk, 2001; Pitt – McQueen, 1998). For example, words with high phonotactic probability are processed faster by native speakers in same-different tasks (Luce – Large, 2001), and pseudowords with high phonotactic probability are judged as more word-like by adults (Vitevitch et al., 1997). In this paper we present a phonotactic calculator for Czech implemented as a Python script. The script relies on frequency data from three freely available corpora of Czech: SYN2015 and SYN2020, corpora of written Czech (Křen et al., 2015; 2020), and ORAL v1, a corpus of spoken Czech (Kopřivová et al., 2017). The steps of the calculation mirror those developed by Vitevitch and Luce (2004) for English, and the script can provide phonotactic (and additionally orthotactic) probability for any Czech word or pseudoword. The script can be downloaded at <https://phonocalc.github.io>. and Pojem fonotaktická probabilita odkazuje k frekvenci, s níž se fonologické segmenty a sekvence těchto segmentů objevují ve slovech v určitém jazyce (Vitevitch – Luce, 2004). Předchozí výzkumy ukázaly, že fonotaktická probabilita slov hraje důležitou roli při zpracování a akvizici jazyka (Jusczyk et al., 1994; Mattys – Jusczyk, 2001; Pitt – McQueen, 1998). Slova s vysokou fonotaktickou probabilitou jsou například rychleji zpracována rodilými mluvčími v úloze „same-different“ (Luce – Large, 2001) a pseudoslova s vysokou fonotaktickou probabilitou jsou dospělými hodnocena jako pravděpodobnější slova daného jazyka (Vitevitch et al., 1997). V tomto článku představujeme nástroj pro výpočet fonotaktické probability pro češtinu, který je volně dostupný jako skript v programovacím jazyce Python. Nástroj vychází z údajů o frekvenci slov ze tří volně dostupných korpusů českého jazyka: korpus psaného jazyka SYN2015 (Křen et al., 2015), korpus psaného jazyka SYN2020 (Křen et al., 2020) a korpus mluveného jazyka ORAL v1 (Kopřivová et al., 2017). Výpočet replikuje postup původního kalkulátoru pro anglický jazyk (Vitevitch – Luce, 2004) a výstupem je odhad fonotaktické (a navíc také ortotaktické) probability pro jakékoliv české slovo či pseudoslovo. Skript je dostupný z internetové stránky <https://phonocalc.github.io>.
This paper discusses the possibilities of the research of translated Czech as well as so-called translation universals in Czech. It introduces a monolingual comparable corpus Jerome specifically designed at the Institute of the Czech National Corpus to meet the requirements of translation studies researchers. The case study of simplification presents the results of examining this translation universal in translated Czech and shows the advantages as well as disadvantages of the quantitative approach.
This article deals with intercultural contact in branches of multinational companies or corporations founded in the Czech Republic by German, Austrian or Swiss owners. Multinationalbusinesses (large ones in particular) are trying to regulate the communication within the company. This is achieved predominantly by introducing an official corporate language in the company, employing people fluent in the language, and promoting language courses. Our research, based on the analysis of questionnaires and semi-structured interview data, has shown that the foreign employees seldom adapt to the language of the local employees, while the adaptation of the local employees to the language of the foreign ones is not only usual but also expected. The regulation of the communication therefore results in the promotion of primarily asymmetrical language adaptation, which benefits the German, Austrian and Swiss owners and the German-speaking foreign employees delegated by them (the so-called expatriates). However, the companies examined also promote the use of English to a considerable extent, which provides a basis for symmetrical communication between local and expatriate employees. and Der Artikel handelt über den interkulturellen Kontakt in multinationalen Unternehmen, die nach 1989 in der Tschechischen Republik durch deutsche, österreichische und schweizerische Unternehmen gegründet wurden. Multinationale Unternehmen (insbesondere die großen) versuchen die Kommunikation innerhalb des Unternehmens zu regulieren. Dies geschieht vor allem durch Einführung einer Firmensprache im Unternehmen, Anstellung von Mitarbeitern, die der Sprache mächtig sind, und Förderung von Sprachkursen.
In 9 % der Unternehmen ist das Tschechische die einzige Firmensprache, in 55 % übernimmt diese Aufgabe das Deutsche, in 16 % das Englische, in 15 % Deutsch und Englisch, in 5 % Deutsch und Tschechisch. Was die Sprachkurse betrifft, werden in 64 % der Unternehmen Deutschkurse, in 19 % Tschechischkurse und in 48 % Englischkurse gefördert.
Unsere auf Fragebögen und teilstrukturierten Interviews basierende Untersuchung hat gezeigt, dass sich die ausländischen, nach Tschechien entsandten Mitarbeiter nur selten an die Sprache der lokalen Mitarbeiter adaptieren, während die Adaptation der in Tschechien einheimischen Mitarbeiter an die Sprache der ausländischen Mitarbeiter nicht nur üblich ist, sondern auch erwartet wird. Die Regulierung der Kommunikation mündet also primär in eine asymmetrische sprachliche Adaptation zum Vorteil deutscher, österreichischer und schweizerischer Besitzer und deutschsprachiger ausländischer Mitarbeiter (sog. Expatriates), die durch die Besitzer nach Tschechien delegiert werden. Die untersuchten Unternehmen unterstützen jedoch in beachtlichem Ausmaß auch die Verwendung des Englischen, das eine Basis für symmetrische Kommunikation zwischen den in Tschechien einheimischen und nach Tschechien entsandten Mitarbeitern bildet.
Diese Adaptation betrifft jedoch konkret vor allem die Managementebene, während die Produktion weitgehend tschechisch geprägt bleibt. Weit verbreitet ist auch die Nicht-Adaptation, die zum Einsatz von Dolmetschern und Übersetzern führt. Dies ist – neben der asymmetrischen Adaptation und dem Rückgriff auf das Englische – in 80 % der Unternehmen bzw. in 95 % der großen Unternehmen der Fall.
Eine Detailbeschreibung der Kommunikation in einem der auf dem Gebiet der Tschechischen Republik tätigen Unternehmen des Siemens-Konzerns macht deutlich, wie die Funktionsstellen in einem Produktionsunternehmen besetzt und mit welcher sprachlichen Qualifikation diese verbunden werden, sie zeigt aber auch, wie sich die Firmensprache ändert, wie die interkulturelle Kommunikation unter Einsatz von sprachlich qualifizierten Mitarbeitern konkret abläuft und wie diese – etwa in Sprachkursen – auf ihre Aufgaben vorbereitet werden.