This dataset includes 45300 Persian word forms which are manually segmented into sequences of morphemes. Lemmas and some extra information about those words are also included. Words are separated by "\n" and in each line (for each word) we have this information: word lemma form ambiguity segments_1 segment_2 ... segment_n where "form" could be one of these: V: verb E: Name entity word I: Irregular plural X: none of the above and "ambiguity" field could be 0 which means the word has only one meaning and is 1 when the word has more than one meaning. For more information about this dataset, you can see [1]. Methodology: We extracted our primary word list from a collection of three corpora. The first one contains sentences extracted from the Persian Wikipedia [2]. The second one is a popular Persian corpusBijanKhan [3], and the last one is Persian Named Entity corpus [4]. For all those corpora, we used the Hazm toolkit (Persian preprocessing and tokenization tools) [5] and the stemming tool presented by [6]. We extracted and normalized all sentences and lemmatized all words using our rule-based lemmatizer that uses the collection of Persian lemmas. Finally, all semi-spaces are automatically detected and fixed. Words with more than 10 occurrences in our corpus collection were selected for manual annotation, which resulted in a set of around 80K word forms. We distributed them among 16 annotators in the way that each word was checked and annotated by two persons independently. Annotators decided about the lemma of a word under question, segmentation parts, plurality, ambiguity (whether a word has more than one meaning). The manual annotation of segmentation was accelerated by predicting morpheme boundaries by our automatic segmenter and offering the most confident ones to the annotators. The annotators might indicate that the word is not a proper Persian word, which led to removing almost 30K words from the lexicon. The remaining 46000 words were sent for resolving inter-annotator differences. All disagreements were reviewed and corrected by the main authors. Finally, all annotated words were quickly reviewed by two Persian linguists. The whole process took almost six weeks and the size of the final dataset is 45300 words. Finally, in order to make the data more appropriate for future segmentation experiments, we divided it into three different sets. The training set includes almost 37K, both test and development sets includes around 4K words each. Moreover, we divided the dataset based on their derivational trees which helps us to have all words with the same root in the same set [1]. Acknowledgment: The research was supported by OP RDE project No. CZ.02.2.69/0.0/0.0/16_027/0008495, International Mobility of Researchers at Charles University, by GACR 19-14534S, and by LM2015071. The authors gratefully acknowledge the contribution and help of the people listed below: - Alireza Abdi - Sahar Badri - Abbas Beygi - Aysan Chehreh - Matin Ebrahimkhani - Aryan Fallah - Fatemeh Fallah - Seyed Amirhossein Hosseini - Amirhossein Mafi - Zohreh Kazemi - Mohammad Mahmoudi - Nazanin Pakdan - Seyed Ahmad Sharifi References: [1] Hamid Haghdoost, Ebrahim Ansari, and Zdeněk Žabokrtský, Building a Morphological Network for Persian on Top of a Morpheme-Segmented Lexicon, to appear in the proceedings of the second Workshop on Resources and Tools for Derivational Morphology, 2019 [2] Akbar Karimi, Ebrahim Ansari, and Bahram Sadeghi Bigham. 2018. Extracting an English-Persian Parallel Corpus from Comparable Corpora. InProceedings of LREC 2018. [3] Mahmood Bijankhan, Javad Sheykhzadegan, Mohammad Bahrani, and Masood Ghayoomi. 2011. Lessons from building a Persian written corpus: Peykare. Language Resources and Evaluation45(2): 143–164 [4] Hanieh Poostchi, Ehsan Zare Borzeshi, and Massimo Piccardi. 2018. BiLSTM-CRF for Persian Named-EntityRecognition ArmanPersoNERCorpus: the First Entity-Annotated Persian Dataset. In Proceedings of LREC2018, Miyazaki, Japan, May 7-12, 2018. [5] https://github.com/sobhe/hazm [6] Hossein Taghi-Zadeh, Mohammad Hadi Sadreddini, Mohammad Hasan Diyanati, and Amir Hossein Rasekh. 2015.A new hybrid stemming method for Persian language.Digital Scholarship in the Humanities32(1):209–221. Authors: Ebrahim Ansari, Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic, ansari@ufal.mff.cuni.cz Zdeněk Žabokrtský, Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic, zdenek.zabokrtsky@mff.cuni.cz Hamid Haghdoost, Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran, hamid.h@iasbs.ac.ir Mahshid Nikravesh, Department of Chemistry, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran, hamid.h@iasbs.ac.ir