A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
FASpell dataset was developed for the evaluation of spell checking algorithms. It contains a set of pairs of misspelled Persian words and their corresponding corrected forms similar to the ASpell dataset used for English.
The dataset consists of two parts:
a) faspell_main: list of 5050 pairs collected from errors made by elementary school pupils and professional typists.
b) faspell_ocr: list of 800 pairs collected from the output of a Farsi OCR system.