A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2.
The data can be found here: https://www.ida.liu.se/labs/nlplab/ges/
The application, developed in C#, automatically identifies the language of a text written in one of the 21 European Union languages. By using training texts in different languages (approx. 1.5Mb of text for each language), a training module counts the prefixes (the first 3 characters) and the suffixes (4 characters endings) for all the words in the texts, for each language. For every language two models are constructed, containing the weights (percentages) of prefixes and suffixes in the texts representing a language. In the prediction phase, for a new text, two models are built on the fly in a similar manner. These models are then compared with the stored models representing each language for which the application was trained. Using comparison functions, the best model is chose. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]: -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2008). RACAI's Linguistic Web Services. In Proceedings of the 6th Language Resources and Evaluation Conference - LREC 2008, Marrakech, Morocco, May 2008. ELRA - European Language Resources Association. ISBN 2-9517408-4-0. -- Dan Tufiş and Alexandru Ceauşu (2007). Diacritics Restoration in Romanian Texts. In Elena Paskaleva and Milena Slavcheva (eds.), A Common Natural Language Processing Paradigm for Balkan Languages - RANLP 2007 Workshop Proceedings, pp. 49-56, Borovets, Bulgaria, September 2007. INCOMA Ltd., Shoumen, Bulgaria. ISBN 978-954-91743-8-0. -- Dan Tufiş and Adrian Chiţu (1999). Automatic Insertion of Diacritics in Romanian Texts. In Ferenc Kiefer, Gábor Kiss, and Júlia Pajzs (eds.), Proceedings of the 5th International Workshop on Computational Lexicography (COMPLEX 1999), pp. 185-194, Pecs, Hungary, May 1999. Linguistics Institute, Hungarian Academy of Sciences.
The database will contain an etymological lexicon of Saami languages complete with detailed source citations. The database will be open to the public in November 2006 and will be updated regularly.
ANNIS2 is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation. ANNIS, which stands for ANNotation of Information Structure, has been designed to provide access to the data of the SFB 632 - "Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts". Since information structure interacts with linguistic phenomena on many levels, ANNIS2 addresses the SFB's need to concurrently annotate, query and visualize data from such varied areas as syntax, semantics, morphology, prosody, referentiality, lexis and more. For project working with spoken language, support for audio / video annotations is also required.
Tool for manual on-line annotation of corpora at various linguistic levels. The levels currently implemented are: word-level and sentence-level segmentation, morphosyntax, word sense disambiguation. Anotatornia implements sophisticated mechanisms of the management of texts, annotators and conflicts.
This corpus constitutes all sentences representing the Arabic Controlled Language (ACL). It contains 551 sentences taken from four textbooks and websites dedicated to teach Arabic language to kids such as: a) First grade book, Republic of Sudan (كتاب الصف الاول جمهورية السودان), b) Al Jazeera Educational Site (موقع الجزيرة التعليمي), c) Bella Preparatory School Girls Forum (منتدى مدرسة بيلا الاعدادية بنات), and d) Albahr website (موقع انا البحر). These sentences are respecting 52 ACL rules. The average number of sentences for each rule is 10.6. All sentences in the corpus were analyzed by Farasa syntactic parser to confirm they are correctly analyzed. The validity of the parsing was done manually by linguist experts.
The structure of this corpus is made of a header and a body. The header consists of a set of metadata that describe the corpus, such as the corpus name, the authors, the sources and further meta data. While the header is made of metadata, the body contains rules. Each rule has a code, a structure and all sentences respecting that rule. For each sentence, we store an id, the vowelledand unvowelled text as well as the result of parsing using Farasa.
Araucaria is a software tool for analysing arguments. It aids a user in reconstructing and diagramming an argument using a simple point-and-click interface. The software also supports argumentation schemes, and provides a user-customisable set of schemes with which to analyse arguments. Written in Java, released under the GNU General Public License.