A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
The book [1] contains spelling rules classified into ten categories, each category containing many rules. This XML file presents our implemented rules classified with six category tags, as is the case in the book. We implemented 24 rules since the remaining rules require diacritical and morphological analysis that are outside the scope of our present work.
References:
[1] Dr.Fahmy Al-Najjar, 'Spelling rules in ten easy lessons', Al Kawthar Library,2008. Available: https://www.alukah.net/library/0/53498/%D9%82%D9%88%D8%A7%D8%B9%D8%AF-%D8%A7%D9%84%D8%A5%D9%85%D9%84%D8%A7%D8%A1-%D9%81%D9%8A-%D8%B9%D8%B4%D8%B1%D8%A9-%D8%AF%D8%B1%D9%88%D8%B3-%D8%B3%D9%87%D9%84%D8%A9-pdf/
Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data (https://hdl.handle.net/11234/1-4758). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_210_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Tokenizer, POS Tagger, Lemmatizer and Parser models for 131 treebanks of 72 languages of Universal Depenencies 2.12 Treebanks, created solely using UD 2.12 data (https://hdl.handle.net/11234/1-5150). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_212_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Tokenizer, POS Tagger, Lemmatizer and Parser models for 90 treebanks of 60 languages of Universal Depenencies 2.4 Treebanks, created solely using UD 2.4 data (http://hdl.handle.net/11234/1-2988). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_24_models .
To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe .
In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks, created solely using UD 2.5 data (http://hdl.handle.net/11234/1-3105). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_25_models .
To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe .
In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data (https://hdl.handle.net/11234/1-3226). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_26_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .