It calculates the Term Frequency and the Inverse Document Frequency of a word in a given corpus (a statistical measure used to evaluate how important a word is to a document in a collection or corpus).
A package of tools for Catalan and Spanish corpus processing. It includes a text handling module and a probabilistic POS tagger. It also allows consulting POS tagger dictionary data.
An electronic version of a vocabulary that resulted from the collaboration with the Labour Department. Its nomenclature includes more than 1,000 terms; besides, it contains six thematic annexes and a Catalan-Spanish index.
Trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia (based on a 2006 dump) and has been automatically enriched with linguistic information. In its present version, it contains over 750 million words.