We defined 58 dramatic situations and annotated them in 19 play scripts. Then we selected only 5 well-recognized dramatic situations and annotated further 33 play scripts. In this version of the data, we release only play scripts that can be freely distributed, which is 9 play scripts. One play is annotated independently by three annotators.
COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing.
The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation.
The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space.
Costra 1.1 is a new dataset for testing geometric properties of sentence embeddings spaces. In particular, it concentrates on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of Costra 1.0, which was extended with more sentences and sentence comparisons.
This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague, as reported in the institute's system Biblio. For each publication, the authors are obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. No filtering was performed, except for removing entries missing the Czech or English abstract, and replacing newline and tabulator characters by spaces.
This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague. For each publication record, the authors are obliged to provide both the original abstract (in Czech or English), and its translation (English or Czech) in the internal Biblio system. The data was filtered for duplicates and missing entries, ensuring that every record is bilingual. Additionally, records of published papers which are indexed by SemanticScholar contain the respective link. The dataset was created from September 2022 image of the Biblio database and is stored in JSONL format, with each line corresponding to one record.
This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu).
The texts are sentences from the Europarl parallel corpus (Koehn, 2005). We selected the monolingual sentences from parallel corpora for the following pairs: Bulgarian-English, Czech-English, Portuguese-English and Spanish-English. The English corpus is comprised by the English side of the Spanish-English corpus.
Basque is not in Europarl. In addition, it contains the Basque and English sides of the GNOME corpus.
The texts have been automatically annotated with NLP tools, including Word Sense Disambiguation, Named Entity Disambiguation and Coreference resolution. Please check deliverable D5.6 in http://qtleap.eu/deliverables for more information.
LiFR-Law is a corpus of Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties based on the Hamburg Comprehensibility Concept (Langer, Schulz von Thun, Tausch, 1974). It has been built as a pilot data set to explore the Linguistic Factors of Readability (hence the LiFR acronym) in Czech administrative and legal texts, modeling their correlation with actually observed reading comprehension. The corpus is comprised of 18 documents in total; that is, six different texts from the legal/administration domain, each in three versions: the original and two paraphrases. Each such document triple shares one reading-comprehension test administered to at least thirty readers of random gender, educational background, and age. The data set also captures basic demographic information about each reader, their familiarity with the topic, and their subjective assessment of the stylistic properties of the given document, roughly corresponding to the key text properties identified by the Hamburg Comprehensibility Concept.
LiFR-Law is a corpus of Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties based on the Hamburg Comprehensibility Concept (Langer, Schulz von Thun, Tausch, 1974). It has been built as a pilot data set to explore the Linguistic Factors of Readability (hence the LiFR acronym) in Czech administrative and legal texts, modeling their correlation with actually observed reading comprehension. The corpus is comprised of 18 documents in total; that is, six different texts from the legal/administration domain, each in three versions: the original and two paraphrases. Each such document triple shares one reading-comprehension test administered to at least thirty readers of random gender, educational background, and age. The data set also captures basic demographic information about each reader, their familiarity with the topic, and their subjective assessment of the stylistic properties of the given document, roughly corresponding to the key text properties identified by the Hamburg Comprehensibility Concept.
Changes to the previous version and helpful comments
• File names of the comprehension test results (self-explanatory)
• Corrected one erroneous automatic evaluation rule in the multiple-choice evaluation (zahradnici_3,
TRUE and FALSE had been swapped)
• Evaluation protocols for both question types added into Folder lifr_formr_study_design
• Data has been cleaned: empty responses to multiple-choice questions were re-inserted. Now, all surveys
are considered complete that have reader’s subjective text evaluation complete (these were placed at
the very end of each survey).
• Only complete surveys (all 7 content questions answered) are represented. We dropped the replies of
six users who did not complete their surveys.
• A few missing responses to open questions have been detected and re-inserted.
• The demographic data contain all respondents who filled in the informed consent and the demographic
details, with respondents who did not complete any test survey (but provided their demographic
details) in a separate file. All other data have been cleaned to contain only responses by the regular
respondents (at least one completed survey).
Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features based on the Hamburg Comprehensibility Concept