A vocabulary resulting from the cooperation of the groups of REALITER network that collects the basic terminology mostly used in texts about Genomics. It contains equivalents in English, Peninsular and Latinamerican Spanish, French, Italian, Galician, Portuguese and Catalan.
Statistical analysis service: It calculates P(cue|class): probability of seeing a linguistic cue given a lexical class. This probability is computed given the occurrences of cues in a corpus (codified in the signatures file) and the information of belonging or not belonging of these words to different classes (codified in indicators file).
The probability is computed for each studied cue in the signatures file and for each class in the indicators file.
This RESTful service allows to define a sub-corpus from different annotated corpora. The service includes a POS tag harmonisation process where original tags are converted to EAGLES/Parole format. The eventual sub-corpus is indexed using the IMS CWB tool. The user receives an ID which can be used by the CQP service to exploit the sub-corpus.
This RESTful service accesses part of the Hemeroteca Digital de l’Arxiu Municipal de Girona (digital press archive from the Girona city council), specifically Catalan press from 2003. The service uses the SRU protocol.
Search engine for the neologisms database of the NEOROM network. The network collects neologisms used in the press written in Romance languages from 2005 onwards.
Oral corpus containing 166 narratives in English elicited by means of Labovian techniques. Participants from the UK (England, Wales, Scotland), Ireland, USA, Australia and South Africa.
The electronic version of the book “Corpus PAAU 1992: Descriptive Studies, Texts and Vocabulary” includes the texts that have been object of analysis in this project as well as the vocabulary lists that make up the Corpus 92.
domain specific corpus (Law, Economy, Computing, Medicine and Environment as well as a contrastive corpus from the press); EN 3.3 M tokens, SP 33 M tokens, CAT 19 M tokens; EAGLEs pos tagset
This SOAP service implements the IMS Open Corpus Workbench (CWB), a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. Its central component is the flexible and efficient query processor CQP. The service makes it possible to index a new corpus and query it.
Text preprocess (this preprocess service requires that the input text be in plain text format (file .txt) and UTF-8).
Basically, it carries out: (i) text segmentation into minor structural units (titles, paragraphs, sentences, etc.); (ii) detection of entities not found in dictionaries (numbers, abbreviations, URLs, emails, proper nouns, etc.); and (iii) the keeping of sequences of two or more words in a single block (dates, phrases, proper nouns, etc.).
POS tagger. (The input file must be in plain text format (file.txt) and UTF-8 encoded. The disambiguation process is done by a TreeTagger instance trained by the IULA.)
A tool for statistical corpus exploitation. It offers concordances, counts ngrams, extracts collocations and gives association, distribution and similarity measures.
A tool for contrasting terminological vocabularies and textual corpora. It allows controlling the presence and location of reference vocabularies in textual corpora.
Ted Pedersen's Ngram Statistics Package (used to identify word Ngrams that appear in large corpora using standard tests of association such as Fisher's exact test, the log likelihood ratio, Pearson's chi-squared test, the Dice Coefficient, etc.).
A package of tools for the processing of the Corpus Tècnic in Catalan and Spanish. It includes a preprocessor, a PoSTagger and a linguistic disambiguator.
It calculates the Term Frequency and the Inverse Document Frequency of a word in a given corpus (a statistical measure used to evaluate how important a word is to a document in a collection or corpus).
An electronic version of a vocabulary that resulted from the collaboration with the Labour Department. Its nomenclature includes more than 1,000 terms; besides, it contains six thematic annexes and a Catalan-Spanish index.