OAGSX Title Generation Dataset ============================== OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGSX Title Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from LINDAT/CLARIN repository http://hdl.handle.net/11234/1-3079 Publications ------------ If using it, please cite the following paper: Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. LREC 2020, Proceedings of the the 12th International Conference on Language Resources and Evaluation, Marseille, France, May 2020 Acknowledgements ---------------- This research work was [partially] supported by OP RDE project No. CZ.02.2.69/0.0/0.0/16_027/0008495, International Mobility of Researchers at Charles University. Statistics of OAGSX: -------------------- Total samples: 34408509 Title tokens mean: 13.04 std: 5.13 min: 3 max: 25 Abstract tokens mean: 182.19 std: 89.20 min: 50 max: 400 Abs-Tit overlap mean: 0.7713 std: 0.1796