A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
The book [1] contains spelling rules classified into ten categories, each category containing many rules. This XML file presents our implemented rules classified with six category tags, as is the case in the book. We implemented 24 rules since the remaining rules require diacritical and morphological analysis that are outside the scope of our present work.
References:
[1] Dr.Fahmy Al-Najjar, 'Spelling rules in ten easy lessons', Al Kawthar Library,2008. Available: https://www.alukah.net/library/0/53498/%D9%82%D9%88%D8%A7%D8%B9%D8%AF-%D8%A7%D9%84%D8%A5%D9%85%D9%84%D8%A7%D8%A1-%D9%81%D9%8A-%D8%B9%D8%B4%D8%B1%D8%A9-%D8%AF%D8%B1%D9%88%D8%B3-%D8%B3%D9%87%D9%84%D8%A9-pdf/