Number of results to display per page
Search Results
1542. Nynorskordboka
- Publisher:
- University of Oslo
- Type:
- lexicalConceptualResource
- Language:
- Norwegian Nynorsk
- Description:
- 90 000 entries with definitions, etymology, examples
- Rights:
- Not specified
1543. OAGK Keyword Generation Dataset
- Creator:
- Çano, Erion
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- keyword extraction and supervised keyword generation
- Language:
- English
- Description:
- OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
1544. OAGKX Keyword Generation Dataset
- Creator:
- Çano, Erion
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- keyword extraction, supervised keyword generation, and abstractive keyphrasing
- Language:
- English
- Description:
- OAGKX is a keyword extraction/generation dataset consisting of 22674436 abstracts, titles and keyword strings from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGKX Keyword Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano Erion, Bojar Ondřej. Keyphrase Generation: A Multi-Aspect Survey. FRUCT 2019, Proceedings of the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finland, Nov. 2019 To reproduce the experiments in the above paper, you can use the first 100000 lines of part_0_0.txt file.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
1545. OAGL Paper Metadata Dataset
- Creator:
- Çano, Erion
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- Paper Length Prediction, Scientific Papers Corpus, and Scientific Publication Metadata
- Language:
- English
- Description:
- OAGL is a paper metadata dataset consisting of 17528680 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last field of each record is the page length of the corresponding publication. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGL Paper Metadata Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano Erion, Bojar Ondřej: How Many Pages? Paper Length Prediction from the Metadata. NLPIR 2020, Proceedings of the the 4th International Conference on Natural Language Processing and Information Retrieval, Seoul, Korea, December 2020.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
1546. OAGS Title Generation Dataset
- Creator:
- Çano, Erion
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- Title Generation Dataset, Abstractive Text Summarization, and Scientific Papers Corpus
- Language:
- English
- Description:
- OAGS is a title generation dataset consisting of 34993700 abstracts and titles from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGS Title Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, "Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study", INLG 2019, The 12th International Conference on Natural Language Generation, November 2019, Tokyo, Japan. To reproduce the experiments in the above paper, you can use oags_train1.txt, oags_train2.txt, oags_train3.txt, oags_test.txt and oags_val.txt files. If you need more data samples you can get them from oags_train_backup.txt and oags_val-test_backup.txt.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
1547. OAGSX Title Generation Dataset
- Creator:
- Çano, Erion
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- Title Generation Dataset, Abstractive Text Summarization, and Scientific Papers Corpus
- Language:
- English
- Description:
- OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGSX Title Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please consider citing also the following paper: Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. LREC 2020, Proceedings of the the 12th International Conference on Language Resources and Evaluation, Marseille, France, May 2020.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
1548. Oasis Numbers
- Publisher:
- MTA-SZTE Research Group on Artificial Intelligence
- Type:
- corpus
- Subject:
- speech corpus
- Language:
- Hungarian
- Description:
- spoken, monolingual, manually segmented domain-specific corpus of numbers, 5857 recorded words
- Rights:
- Not specified
1549. Objects from the Scene of Reinhard Heydrich' s Assassination
- Creator:
- Aktualita
- Publisher:
- Národní filmový archiv
- Type:
- video and clip
- Subject:
- heydrichiáda, kolo jízdní dámské, předměty doličné, aktovka kožená, kabát pánský, samopal, atentát Heydrich Reinhard, pátrání policejní Protektorát, and Heydrichiáda
- Language:
- German
- Description:
- Segment consisting of footage showing objects from the scene of the assassination of acting Reich Protector Reinhard Heydrich, which was screened in all cinemas throughout the Protectorate. The camera shots capture a woman´s bicycle, a man´s coat, a cap with a visor, two leather briefcases, and a submachine gun made in England. The subtitles urge the members of the audience to identify the owners of the items in question and to help the police catch the perpetrators. The film was part of an aggressive campaign to spread fear of the annihilation of the nation, reinforced through the daily publication of the names of the executed in the media.
- Rights:
- http://creativecommons.org/licenses/by-nc-nd/4.0/, PUB, and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
1550. Ocorrect Corpus
- Type:
- corpus
- Language:
- Bulgarian and German
- Description:
- Written, synchronic, general, bilingual, text and image; 1 000 000 tokens Bulgarian2300 image files150 000 tokens Greman312 image files
- Rights:
- Not specified