dc.contributor.author | Çano, Erion |
dc.date.accessioned | 2019-10-21T13:30:24Z |
dc.date.available | 2019-10-21T13:30:24Z |
dc.date.issued | 2019-10-21 |
dc.identifier.uri | http://hdl.handle.net/11234/1-3062 |
dc.description | OAGKX is a keyword extraction/generation dataset consisting of 22674436 abstracts, titles and keyword strings from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGKX Keyword Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano Erion, Bojar Ondřej. Keyphrase Generation: A Multi-Aspect Survey. FRUCT 2019, Proceedings of the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finland, Nov. 2019 To reproduce the experiments in the above paper, you can use the first 100000 lines of part_0_0.txt file. |
dc.language.iso | eng |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825460 |
dc.relation.isreferencedby | https://ieeexplore.ieee.org/document/8981519 |
dc.relation.replaces | http://hdl.handle.net/11234/1-2943 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ |
dc.subject | keyword extraction |
dc.subject | supervised keyword generation |
dc.subject | abstractive keyphrasing |
dc.title | OAGKX Keyword Generation Dataset |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Erion Çano cano@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
sponsor | Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.2.69/0.0/0.0/16_027/0008495 OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy nationalFunds |
sponsor | European Union H2020-ICT-2018-2-825460 ELITR - European Live Translator euFunds info:eu-repo/grantAgreement/EC/H2020/825460 |
size.info | 22674436 entries |
size.info | 37 files |
size.info | 27.4 gb |
size.info | 8.5 gb |
files.size | 9139358485 |
files.count | 2 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- oagkx.zip
- Size
- 8.51 GB
- Format
- application/zip
- Description
- data
- MD5
- 8a6475ea0d5a38c7aff97a0f5260df20
- oagkx
- part_11_0.txt11 MB
- part_3_1.txt1 GB
- part_0_1.txt900 MB
- part_13_0.txt69 MB
- part_10_0.txt873 MB
- part_2_1.txt1 GB
- part_12_0.txt11 MB
- part_5_1.txt877 MB
- part_1_1.txt867 MB
- part_14_0.txt1 GB
- part_7_1.txt120 MB
- part_4_1.txt1 GB
- part_9_1.txt867 MB
- part_6_1.txt1 GB
- part_8_1.txt541 MB
- part_0_0.txt752 MB
- part_3_0.txt1 GB
- part_5_0.txt1 GB
- part_2_0.txt1 GB
- part_7_0.txt1 GB
- part_4_0.txt1 GB
- part_1_0.txt1 GB
- part_9_0.txt709 MB
- part_6_0.txt789 MB
- part_8_0.txt561 MB
- part_11_1.txt9 MB
- part_13_1.txt108 MB
- part_10_1.txt58 MB
- part_3_2.txt437 MB
- part_0_2.txt770 MB
- part_5_2.txt880 MB
- part_2_2.txt345 MB
- part_12_1.txt9 MB
- part_4_2.txt568 MB
- part_1_2.txt759 MB
- part_14_1.txt1 GB
- part_7_2.txt311 MB
- Name
- README.txt
- Size
- 1.93 KB
- Format
- Text file
- Description
- readme
- MD5
- a286e714b793d3a196864122183a7fa1
OAGKX Keyword Generation Dataset ================================ OAGKX is a keyword extraction/generation dataset consisting of 22674436 abstracts, titles and keyword strings from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGKX Keyword Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from LINDAT/CLARIN repository http://hdl.handle.net/11234/1-3062 Publications ------------ If using it, please cite the following paper: Çano Erion, Bojar Ondřej. Keyphrase Generation: A Multi-Aspect Survey. FRUCT 2019, Proceedings of th . . .