A vocabulary resulting from the cooperation of the groups of REALITER network that collects the basic terminology mostly used in texts about Genomics. It contains equivalents in English, Peninsular and Latinamerican Spanish, French, Italian, Galician, Portuguese and Catalan.
Statistical analysis service: It calculates P(cue|class): probability of seeing a linguistic cue given a lexical class. This probability is computed given the occurrences of cues in a corpus (codified in the signatures file) and the information of belonging or not belonging of these words to different classes (codified in indicators file).
The probability is computed for each studied cue in the signatures file and for each class in the indicators file.
This RESTful service allows to define a sub-corpus from different annotated corpora. The service includes a POS tag harmonisation process where original tags are converted to EAGLES/Parole format. The eventual sub-corpus is indexed using the IMS CWB tool. The user receives an ID which can be used by the CQP service to exploit the sub-corpus.
This RESTful service accesses part of the Hemeroteca Digital de l’Arxiu Municipal de Girona (digital press archive from the Girona city council), specifically Catalan press from 2003. The service uses the SRU protocol.
Search engine for the neologisms database of the NEOROM network. The network collects neologisms used in the press written in Romance languages from 2005 onwards.
Oral corpus containing 166 narratives in English elicited by means of Labovian techniques. Participants from the UK (England, Wales, Scotland), Ireland, USA, Australia and South Africa.
The electronic version of the book “Corpus PAAU 1992: Descriptive Studies, Texts and Vocabulary” includes the texts that have been object of analysis in this project as well as the vocabulary lists that make up the Corpus 92.
domain specific corpus (Law, Economy, Computing, Medicine and Environment as well as a contrastive corpus from the press); EN 3.3 M tokens, SP 33 M tokens, CAT 19 M tokens; EAGLEs pos tagset
"A scholarly edition of Aquinas's Opera omnia, with a lexical database, a dictionary, two collection of historical sources, and an extensive bibliography."
This SOAP service implements the IMS Open Corpus Workbench (CWB), a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. Its central component is the flexible and efficient query processor CQP. The service makes it possible to index a new corpus and query it.
Collection of different digitized mastheads in Catalan and Spanish, covering a time span from 1808 to 2008. The collection, which is kept in the Girona City Council Archive, totals 1.599.733 digitized pages.
EDBL (Lexical DataBase for Basque) is the lexical basis needed for the automatic treatment of Basque. It is made up of about 120.000 entries divided into dictionary entries (the same you can find in a conventional dictionay), verb forms and dependent morphemes, all of them with their respective morphological information.
Open source language analysis tool suite: tokenizer, stemmer/lemmatizer, named entity recognizer, chunker/segmenter, morphosyntactic tagger, syntactic tagger, corpus processer, morphological tagger, semantic tagger, analyzer, Word Sense Disambiguator.
Text preprocess (this preprocess service requires that the input text be in plain text format (file .txt) and UTF-8).
Basically, it carries out: (i) text segmentation into minor structural units (titles, paragraphs, sentences, etc.); (ii) detection of entities not found in dictionaries (numbers, abbreviations, URLs, emails, proper nouns, etc.); and (iii) the keeping of sequences of two or more words in a single block (dates, phrases, proper nouns, etc.).
POS tagger. (The input file must be in plain text format (file.txt) and UTF-8 encoded. The disambiguation process is done by a TreeTagger instance trained by the IULA.)
A tool for statistical corpus exploitation. It offers concordances, counts ngrams, extracts collocations and gives association, distribution and similarity measures.
JIRS is a Passage Retrieval system specially suited for Question Answering. It could be adapted to others languages very easily. ask (Written Language): Information Retrieval Applications Question/Answering Environment: OS-independent Access: GPLv3
A tool for contrasting terminological vocabularies and textual corpora. It allows controlling the presence and location of reference vocabularies in textual corpora.
Multilingual lexical database that follows the model proposed by the EuroWordNet project. The MCR integrates into the same EuroWordNet framework wordnets from five different languages (together with four English WordNet versions). It also integrates WordNet Domains and new versions of the Base Concepts and Top Concept Ontology. Overall, it contains 1,642,389 semantic relations between synsets, most of them acquired by automatic means. Information contained: semantics, synonyms, antonyms, definition, equivalents, example of use, morphology.
Ted Pedersen's Ngram Statistics Package (used to identify word Ngrams that appear in large corpora using standard tests of association such as Fisher's exact test, the log likelihood ratio, Pearson's chi-squared test, the Dice Coefficient, etc.).
A package of tools for the processing of the Corpus Tècnic in Catalan and Spanish. It includes a preprocessor, a PoSTagger and a linguistic disambiguator.
It calculates the Term Frequency and the Inverse Document Frequency of a word in a given corpus (a statistical measure used to evaluate how important a word is to a document in a collection or corpus).
A package of tools for Catalan and Spanish corpus processing. It includes a text handling module and a probabilistic POS tagger. It also allows consulting POS tagger dictionary data.
An electronic version of a vocabulary that resulted from the collaboration with the Labour Department. Its nomenclature includes more than 1,000 terms; besides, it contains six thematic annexes and a Catalan-Spanish index.
Trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia (based on a 2006 dump) and has been automatically enriched with linguistic information. In its present version, it contains over 750 million words.