A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2.
The data can be found here: https://www.ida.liu.se/labs/nlplab/ges/
This corpus constitutes all sentences representing the Arabic Controlled Language (ACL). It contains 551 sentences taken from four textbooks and websites dedicated to teach Arabic language to kids such as: a) First grade book, Republic of Sudan (كتاب الصف الاول جمهورية السودان), b) Al Jazeera Educational Site (موقع الجزيرة التعليمي), c) Bella Preparatory School Girls Forum (منتدى مدرسة بيلا الاعدادية بنات), and d) Albahr website (موقع انا البحر). These sentences are respecting 52 ACL rules. The average number of sentences for each rule is 10.6. All sentences in the corpus were analyzed by Farasa syntactic parser to confirm they are correctly analyzed. The validity of the parsing was done manually by linguist experts.
The structure of this corpus is made of a header and a body. The header consists of a set of metadata that describe the corpus, such as the corpus name, the authors, the sources and further meta data. While the header is made of metadata, the body contains rules. Each rule has a code, a structure and all sentences respecting that rule. For each sentence, we store an id, the vowelledand unvowelled text as well as the result of parsing using Farasa.
Transcribed narrative interviews with people from East and West Berlin about the events of November 9. 282,000 tokens. TEI XML, lemma and POS. Normalized version also available.
The system Česílko (language data and software tools) was first developed as an answer to a growing need of translation and localisation from one source language to many target languages. The starting system belonged to the Shallow Parse, Shallow Transfer Rule-Based Machine Translation – (RBMT) paradigm and it was designed primarily for translation of related languages. The latest implementation of the system uses a stochastic ranker; so technically it belongs to the hybrid machine translation paradigm, using stochastic methods combined with the traditional Shallow Transfer RBMT methods. The system has been stripped of the accompanying language resources due to copyright restrictions. The data that is available is just for demonstrative purposes.
The code-switching corpus consists of 5x30-minute conversations between four speakers (i.e. a total of 20 speakers). The speakers are bilingual speakers of Papiamento (a creole langauge spoken in the Dutch Antilles) and Dutch. In the course of their free conversations, they engage in code-switching, that is, they use both languages within the same utterance in systematic ways. The corpus is fully transcribed and glossed, coded for language and word class, in ELAN.