Rights: Creative Commons - Attribution 4.0 International (CC BY 4.0)

31. LiFR-Lite (2021-11-05)

Creator:: Cinková, Silvie, Chromý, Jan, Hořeňovská, Karolína, Kettnerová, Václava, Kolářová, Veronika, Kubištová, Hana, Panevová, Jarmila, and Ševčíková, Magda
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: education, readability, reading comprehension, and text corpora
Language:: Czech
Description:: Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features based on the Hamburg Comprehensibility Concept
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

32. Manual Re-evaluation of Translation Quality of WMT 2018 English-Czech systems

Creator:: Popel, Martin, Tomková, Markéta, and Tomek, Jakub
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: machine translation, manual evaluation, fluency, adequacy, and Translation Turing test
Language:: Czech and English
Description:: This data set contains four types of manual annotation of translation quality, focusing on the comparison of human and machine translation quality (aka human-parity). The machine translation system used is English-Czech CUNI Transformer (CUBBITT). The annotations distinguish adequacy, fluency and overall quality. One of the types is Translation Turing test - detecting whether the annotators can distinguish human from machine translation. All the sentences are taken from the English-Czech test set newstest2018 (WMT2018 News translation shared task www.statmt.org/wmt18/translation-task.html), but only from the half with originally English sentences translated to Czech by a professional agency.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

33. MorfoCzech

Creator:: Pelegrinová, Kateřina, Elšík, Viktor, Čech, Radek, and Mačutek, Ján
Publisher:: University of Ostrava
Type:: text, lexicon, and lexicalConceptualResource
Subject:: word segmentation, morphology, and morphological dictionary
Language:: Czech
Description:: A dictionary of morphologically segmented word forms in Czech. Rules of manual segmentation are described in Pelegrinová, K., Mačutek, J., Čech, R. (2021). The Menzerath-Altmann law as the relation between lengths of words and morphemes in Czech. Jazykovedný časopis, 72, 405-414. The dictionary is based on short stories, fairy tales, letters and studies written by Karel Čapek.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

34. MorfoCzech 1.1

Creator:: Pelegrinová, Kateřina, Elšík, Viktor, Čech, Radek, and Mačutek, Ján
Publisher:: University of Ostrava
Type:: text, lexicon, and lexicalConceptualResource
Subject:: word segmentation, morphology, and morphological dictionary
Language:: Czech
Description:: A dictionary of morphologically segmented word forms in Czech. Rules of manual segmentation are described in Pelegrinová, K., Mačutek, J., Čech, R. (2021). The Menzerath-Altmann law as the relation between lengths of words and morphemes in Czech. Jazykovedný časopis, 72, 405-414. The dictionary is based on short stories, fairy tales, letters and studies written by Karel Čapek.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

35. Motion Encoding Lexicalization Patterns: Portuguese and English Learners

Creator:: Costa-Silva, Jean
Publisher:: University of Georgia
Type:: text, other, and lexicalConceptualResource
Subject:: motion encoding, language acquisition, Portuguese, English, and lexicalization patterns
Language:: English
Description:: General Information: Data collector: Jean Costa Silva (University of Georgia) Date of collection: September-December 2022 Manner of collection: Online questionnaire via Qualtrics Funding: No
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

36. OAGK Keyword Generation Dataset

Creator:: Çano, Erion
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: keyword extraction and supervised keyword generation
Language:: English
Description:: OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

37. OAGKX Keyword Generation Dataset

Creator:: Çano, Erion
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: keyword extraction, supervised keyword generation, and abstractive keyphrasing
Language:: English
Description:: OAGKX is a keyword extraction/generation dataset consisting of 22674436 abstracts, titles and keyword strings from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGKX Keyword Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano Erion, Bojar Ondřej. Keyphrase Generation: A Multi-Aspect Survey. FRUCT 2019, Proceedings of the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finland, Nov. 2019 To reproduce the experiments in the above paper, you can use the first 100000 lines of part_0_0.txt file.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

38. OAGL Paper Metadata Dataset

Creator:: Çano, Erion
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: Paper Length Prediction, Scientific Papers Corpus, and Scientific Publication Metadata
Language:: English
Description:: OAGL is a paper metadata dataset consisting of 17528680 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last field of each record is the page length of the corresponding publication. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGL Paper Metadata Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano Erion, Bojar Ondřej: How Many Pages? Paper Length Prediction from the Metadata. NLPIR 2020, Proceedings of the the 4th International Conference on Natural Language Processing and Information Retrieval, Seoul, Korea, December 2020.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

39. OAGS Title Generation Dataset

Creator:: Çano, Erion
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: Title Generation Dataset, Abstractive Text Summarization, and Scientific Papers Corpus
Language:: English
Description:: OAGS is a title generation dataset consisting of 34993700 abstracts and titles from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGS Title Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, "Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study", INLG 2019, The 12th International Conference on Natural Language Generation, November 2019, Tokyo, Japan. To reproduce the experiments in the above paper, you can use oags_train1.txt, oags_train2.txt, oags_train3.txt, oags_test.txt and oags_val.txt files. If you need more data samples you can get them from oags_train_backup.txt and oags_val-test_backup.txt.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

40. OAGSX Title Generation Dataset

Creator:: Çano, Erion
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: Title Generation Dataset, Abstractive Text Summarization, and Scientific Papers Corpus
Language:: English
Description:: OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGSX Title Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please consider citing also the following paper: Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. LREC 2020, Proceedings of the the 12th International Conference on Language Resources and Evaluation, Marseille, France, May 2020.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

31. LiFR-Lite (2021-11-05)

32. Manual Re-evaluation of Translation Quality of WMT 2018 English-Czech systems

33. MorfoCzech

34. MorfoCzech 1.1

35. Motion Encoding Lexicalization Patterns: Portuguese and English Learners

36. OAGK Keyword Generation Dataset

37. OAGKX Keyword Generation Dataset

38. OAGL Paper Metadata Dataset

39. OAGS Title Generation Dataset

40. OAGSX Title Generation Dataset

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Show values starting with

Rights

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from