dc.contributor.author | Çano, Erion |
dc.date.accessioned | 2019-03-08T12:46:46Z |
dc.date.available | 2019-03-08T12:46:46Z |
dc.date.issued | 2019-04 |
dc.identifier.uri | http://hdl.handle.net/11234/1-2943 |
dc.description | OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA |
dc.language.iso | eng |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825460 |
dc.relation.isreferencedby | https://www.aclweb.org/anthology/N19-1070 |
dc.relation.isreplacedby | http://hdl.handle.net/11234/1-3062 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ |
dc.subject | keyword extraction |
dc.subject | supervised keyword generation |
dc.title | OAGK Keyword Generation Dataset |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Erion Çano cano@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
sponsor | Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.2.69/0.0/0.0/16_027/0008495 OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy nationalFunds |
sponsor | European Union H2020-ICT-2018-2-825460 ELITR - European Live Translator euFunds info:eu-repo/grantAgreement/EC/H2020/825460 |
size.info | 2200000 entries |
size.info | 3 files |
size.info | 3.24 gb |
files.size | 1086288473 |
files.count | 2 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- README.txt
- Size
- 2.25 KB
- Format
- Text file
- Description
- readme
- MD5
- dc3560f8786a522c21ea96c4fc2f5c04
OAGK Keyword Generation Dataset =============================== OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from LINDAT/CLARIN repository http://hdl.handle.net/11234/1-2943 Publications ------------ If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conf . . .
- Name
- OAGK.zip
- Size
- 1.01 GB
- Format
- application/zip
- Description
- data
- MD5
- 92b0d028cde15184add0981349baccb4
- OAGK
- oagk_train.txt2 GB
- oagk_val.txt141 MB
- oagk_test.txt239 MB