It is used morphological lexicon of Bulgarian (100 000 lemmas) compiled as a finite-state automaton in CLaRK System. It requires the text to be first tokenized and it is applied in each token. Includes also guessers for unknown words and Named Entities gazetteers. If the corresponding resources are available for a different language, then it can be tuned to it.
This is a hybrid system: rules, neural network, rules. First
rules for the sure cases are applied, then a neural network
disambiguator is applied, then rules for repairing of the most
frequent errors of the neural network. The rules are implemented
as constraints in CLaRK System. The neural network is additional
module implemented in Java. It is called CLaRK. It requires the
morphologically annotated input.
The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories.