KUK 0.0 is a pilot version of a corpus of Czech legal and administrative texts designated as data for manual and automatic assessment of accessibility (comprehensibility or clarity) of Czech legal texts.
LANGUAGES IN MIGRATION is designed as a representation of authentic spoken Czech and German that is used in informal speech (private environment, spontaneity, unpreparedness etc.) by Czech-German bilingual speakers born in Czechoslovakia around 1955 and who departed for Germany after becoming 12 years old. The corpus is composed of interviews conducted from 2018–2020 with 20 speakers on language biographies and narrated in Czech and German respectively. 10 interviews were recorded with late (German) repatriates and 10 with Czech migrants. The corpus includes transcripts of ca. 14 hours of Czech recordings and ca. 13,5 hours of German recordings. It contains 217 650 orthographic words (i.e. a total of 286 533 tokens including punctuation). Metadata of LANGUAGES IN MIGRATION include basic sociolinguistically relevant speaker categories (gender, year of birth and of migration, level of education and region of childhood and present residence).
The transcription of LANGUAGES IN MIGRATION is linked to the corresponding audio track. The transcription was carried out on the orthographic tier and supplemented by an additional metalanguage tier. The corpus LANGUAGES IN MIGRATION is lemmatized and morphologically tagged in different formats for Czech and German (Stuttgart-Tübingen-Tagset). Deviations from the norm of the spoken Czech and German of the homeland, which are understood as the result of language contact and language isolation, are tagged in a further tier both in the Czech and in the German sub-corpuses of LANGUAGES IN MIGRATION. The (anonymized) corpus is provided in form of transcripts in EAF format, which can be viewed via the freely available ELAN program, and a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz
We present a large corpus of Czech parliament plenary sessions. The corpus
consists of approximately 444 hours of speech data and corresponding text
transcriptions. The whole corpus has been segmented to short audio snippets
making it suitable for both training and evaluation of automatic speech
recognition (ASR) systems. The source language of the corpus is Czech, which
makes it a valuable resource for future research as only a few public datasets
are available for the Czech language.
"Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in five different languages (EN, CS, DE, IT, HI).
This dataset contains annotation of PDT using Czech WordNet ontology: http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
Data is stored in PML format. This is a stand-off annotation and for most use cases it requires PDT 2.0 and the Czech WordNet 1.9 PDT that we have used for annotation. and 1ET100300517, 1ET201120505
GeCzLex 1.0 is an online electronic resource for translation equivalents of Czech and German discourse connectives. It contains anaphoric connectives for both languages and their possible translations documented in bilingual parallel corpora (not necessarily anaphoric). The entries have been interlinked via semantic annotation of the connectives (taken from monolingual lexicons of connectives CzeDLex and DiMLex) according to the PDTB 3 sense taxonomy and translation possibilities aquired from the Czech and German parallel data of the Intercorp project. The lexicon is the first bilingual inventory of connectives with linkage on the level of individual pairs (connective + discourse sense).
LiFR-Law is a corpus of Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties based on the Hamburg Comprehensibility Concept (Langer, Schulz von Thun, Tausch, 1974). It has been built as a pilot data set to explore the Linguistic Factors of Readability (hence the LiFR acronym) in Czech administrative and legal texts, modeling their correlation with actually observed reading comprehension. The corpus is comprised of 18 documents in total; that is, six different texts from the legal/administration domain, each in three versions: the original and two paraphrases. Each such document triple shares one reading-comprehension test administered to at least thirty readers of random gender, educational background, and age. The data set also captures basic demographic information about each reader, their familiarity with the topic, and their subjective assessment of the stylistic properties of the given document, roughly corresponding to the key text properties identified by the Hamburg Comprehensibility Concept.
LiFR-Law is a corpus of Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties based on the Hamburg Comprehensibility Concept (Langer, Schulz von Thun, Tausch, 1974). It has been built as a pilot data set to explore the Linguistic Factors of Readability (hence the LiFR acronym) in Czech administrative and legal texts, modeling their correlation with actually observed reading comprehension. The corpus is comprised of 18 documents in total; that is, six different texts from the legal/administration domain, each in three versions: the original and two paraphrases. Each such document triple shares one reading-comprehension test administered to at least thirty readers of random gender, educational background, and age. The data set also captures basic demographic information about each reader, their familiarity with the topic, and their subjective assessment of the stylistic properties of the given document, roughly corresponding to the key text properties identified by the Hamburg Comprehensibility Concept.
Changes to the previous version and helpful comments
• File names of the comprehension test results (self-explanatory)
• Corrected one erroneous automatic evaluation rule in the multiple-choice evaluation (zahradnici_3,
TRUE and FALSE had been swapped)
• Evaluation protocols for both question types added into Folder lifr_formr_study_design
• Data has been cleaned: empty responses to multiple-choice questions were re-inserted. Now, all surveys
are considered complete that have reader’s subjective text evaluation complete (these were placed at
the very end of each survey).
• Only complete surveys (all 7 content questions answered) are represented. We dropped the replies of
six users who did not complete their surveys.
• A few missing responses to open questions have been detected and re-inserted.
• The demographic data contain all respondents who filled in the informed consent and the demographic
details, with respondents who did not complete any test survey (but provided their demographic
details) in a separate file. All other data have been cleaned to contain only responses by the regular
respondents (at least one completed survey).