dc.contributor.author | Çano, Erion |
dc.date.accessioned | 2019-09-12T10:47:34Z |
dc.date.available | 2019-09-12T10:47:34Z |
dc.date.issued | 2019-09 |
dc.identifier.uri | http://hdl.handle.net/11234/1-3043 |
dc.description | OAGS is a title generation dataset consisting of 34993700 abstracts and titles from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGS Title Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, "Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study", INLG 2019, The 12th International Conference on Natural Language Generation, November 2019, Tokyo, Japan. To reproduce the experiments in the above paper, you can use oags_train1.txt, oags_train2.txt, oags_train3.txt, oags_test.txt and oags_val.txt files. If you need more data samples you can get them from oags_train_backup.txt and oags_val-test_backup.txt. |
dc.language.iso | eng |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825460 |
dc.relation.isreferencedby | https://www.aclweb.org/anthology/W19-8630/ |
dc.relation.isreplacedby | http://hdl.handle.net/11234/1-3079 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ |
dc.subject | Title Generation Dataset |
dc.subject | Abstractive Text Summarization |
dc.subject | Scientific Papers Corpus |
dc.title | OAGS Title Generation Dataset |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Erion Çano cano@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
sponsor | Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.2.69/0.0/0.0/16_027/0008495 OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy nationalFunds |
sponsor | European Union H2020-ICT-2018-2-825460 ELITR - European Live Translator euFunds info:eu-repo/grantAgreement/EC/H2020/825460 |
size.info | 34993700 entries |
size.info | 7 files |
size.info | 46.8 gb |
size.info | 14.8 gb |
files.size | 15992457976 |
files.count | 2 |
Soubory tohoto záznamu
Licenční kategorie:
Licence: Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
Licence: Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Název
- README.txt
- Velikost
- 1.82 KB
- Formát
- Textový soubor
- Popis
- Readme
- MD5
- dbea4cf9d8eba2dae318a74c1a9dc3f0
OAGS Title Generation Dataset =============================== OAGS is a title generation dataset consisting of 34993700 abstracts and titles from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGS Title Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from LINDAT/CLARIN repository http://hdl.handle.net/11234/1-3043 Publications ------------ If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, "Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study", INLG 2019, The 12th Inter . . .
- Název
- OAGS.zip
- Velikost
- 14.89 GB
- Formát
- application/zip
- Popis
- Data
- MD5
- b3def7c79f11d2c109c48cc0a72b88ae
- OAGS
- oags_train3.txt1 GB
- oags_val.txt14 MB
- oags_val-test_backup.txt657 MB
- oags_train2.txt1 GB
- oags_test.txt14 MB
- oags_train_backup.txt42 GB
- oags_train1.txt557 MB