Number of results to display per page
Search Results
22. LiFR-Law. Corpus of Paraphrased Czech Administrative Texts with Reading Comprehension for Readability Studies
- Creator:
- Cinková, Silvie, Chromý, Jan, Šamánková, Jana, Hořeňovská, Karolína, Kettnerová, Václava, Kolářová, Veronika, Kubištová, Hana, and Panevová, Jarmila
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- readability, legal texts, legal domain, reading comprehension, corpus, and survey
- Language:
- Czech
- Description:
- LiFR-Law is a corpus of Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties based on the Hamburg Comprehensibility Concept (Langer, Schulz von Thun, Tausch, 1974). It has been built as a pilot data set to explore the Linguistic Factors of Readability (hence the LiFR acronym) in Czech administrative and legal texts, modeling their correlation with actually observed reading comprehension. The corpus is comprised of 18 documents in total; that is, six different texts from the legal/administration domain, each in three versions: the original and two paraphrases. Each such document triple shares one reading-comprehension test administered to at least thirty readers of random gender, educational background, and age. The data set also captures basic demographic information about each reader, their familiarity with the topic, and their subjective assessment of the stylistic properties of the given document, roughly corresponding to the key text properties identified by the Hamburg Comprehensibility Concept.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
23. LiFR-Law. Corpus of Paraphrased Czech Administrative Texts with Reading Comprehension for Readability Studies (2023-10-08)
- Creator:
- Cinková, Silvie, Chromý, Jan, Šamánková, Jana, Hořeňovská, Karolína, Kettnerová, Václava, Kolářová, Veronika, Kubištová, Hana, and Panevová, Jarmila
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- readability, legal texts, legal domain, reading comprehension, corpus, and survey
- Language:
- Czech
- Description:
- LiFR-Law is a corpus of Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties based on the Hamburg Comprehensibility Concept (Langer, Schulz von Thun, Tausch, 1974). It has been built as a pilot data set to explore the Linguistic Factors of Readability (hence the LiFR acronym) in Czech administrative and legal texts, modeling their correlation with actually observed reading comprehension. The corpus is comprised of 18 documents in total; that is, six different texts from the legal/administration domain, each in three versions: the original and two paraphrases. Each such document triple shares one reading-comprehension test administered to at least thirty readers of random gender, educational background, and age. The data set also captures basic demographic information about each reader, their familiarity with the topic, and their subjective assessment of the stylistic properties of the given document, roughly corresponding to the key text properties identified by the Hamburg Comprehensibility Concept. Changes to the previous version and helpful comments • File names of the comprehension test results (self-explanatory) • Corrected one erroneous automatic evaluation rule in the multiple-choice evaluation (zahradnici_3, TRUE and FALSE had been swapped) • Evaluation protocols for both question types added into Folder lifr_formr_study_design • Data has been cleaned: empty responses to multiple-choice questions were re-inserted. Now, all surveys are considered complete that have reader’s subjective text evaluation complete (these were placed at the very end of each survey). • Only complete surveys (all 7 content questions answered) are represented. We dropped the replies of six users who did not complete their surveys. • A few missing responses to open questions have been detected and re-inserted. • The demographic data contain all respondents who filled in the informed consent and the demographic details, with respondents who did not complete any test survey (but provided their demographic details) in a separate file. All other data have been cleaned to contain only responses by the regular respondents (at least one completed survey).
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
24. LiFR-Lite
- Creator:
- Cinková, Silvie, Chromý, Jan, Hořeňovská, Karolína, Kettnerová, Václava, Kolářová, Veronika, Panevová, Jarmila, and Ševčíková, Magda
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- education, readability, reading comprehension, and text corpora
- Language:
- Czech
- Description:
- Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features based on the Hamburg Comprehensibility Concept
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
25. LiFR-Lite (2021-11-05)
- Creator:
- Cinková, Silvie, Chromý, Jan, Hořeňovská, Karolína, Kettnerová, Václava, Kolářová, Veronika, Kubištová, Hana, Panevová, Jarmila, and Ševčíková, Magda
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- education, readability, reading comprehension, and text corpora
- Language:
- Czech
- Description:
- Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features based on the Hamburg Comprehensibility Concept
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
26. Manual Re-evaluation of Translation Quality of WMT 2018 English-Czech systems
- Creator:
- Popel, Martin, Tomková, Markéta, and Tomek, Jakub
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- machine translation, manual evaluation, fluency, adequacy, and Translation Turing test
- Language:
- Czech and English
- Description:
- This data set contains four types of manual annotation of translation quality, focusing on the comparison of human and machine translation quality (aka human-parity). The machine translation system used is English-Czech CUNI Transformer (CUBBITT). The annotations distinguish adequacy, fluency and overall quality. One of the types is Translation Turing test - detecting whether the annotators can distinguish human from machine translation. All the sentences are taken from the English-Czech test set newstest2018 (WMT2018 News translation shared task www.statmt.org/wmt18/translation-task.html), but only from the half with originally English sentences translated to Czech by a professional agency.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
27. MorfoCzech
- Creator:
- Pelegrinová, Kateřina, Elšík, Viktor, Čech, Radek, and Mačutek, Ján
- Publisher:
- University of Ostrava
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- word segmentation, morphology, and morphological dictionary
- Language:
- Czech
- Description:
- A dictionary of morphologically segmented word forms in Czech. Rules of manual segmentation are described in Pelegrinová, K., Mačutek, J., Čech, R. (2021). The Menzerath-Altmann law as the relation between lengths of words and morphemes in Czech. Jazykovedný časopis, 72, 405-414. The dictionary is based on short stories, fairy tales, letters and studies written by Karel Čapek.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
28. MorfoCzech 1.1
- Creator:
- Pelegrinová, Kateřina, Elšík, Viktor, Čech, Radek, and Mačutek, Ján
- Publisher:
- University of Ostrava
- Type:
- text, lexicon, and lexicalConceptualResource
- Subject:
- word segmentation, morphology, and morphological dictionary
- Language:
- Czech
- Description:
- A dictionary of morphologically segmented word forms in Czech. Rules of manual segmentation are described in Pelegrinová, K., Mačutek, J., Čech, R. (2021). The Menzerath-Altmann law as the relation between lengths of words and morphemes in Czech. Jazykovedný časopis, 72, 405-414. The dictionary is based on short stories, fairy tales, letters and studies written by Karel Čapek.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
29. Motion Encoding Lexicalization Patterns: Portuguese and English Learners
- Creator:
- Costa-Silva, Jean
- Publisher:
- University of Georgia
- Type:
- text, other, and lexicalConceptualResource
- Subject:
- motion encoding, language acquisition, Portuguese, English, and lexicalization patterns
- Language:
- English
- Description:
- General Information: Data collector: Jean Costa Silva (University of Georgia) Date of collection: September-December 2022 Manner of collection: Online questionnaire via Qualtrics Funding: No
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
30. OAGK Keyword Generation Dataset
- Creator:
- Çano, Erion
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- keyword extraction and supervised keyword generation
- Language:
- English
- Description:
- OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB