1970s "representative" corpus of German created by the research group "Linguistik und Maschinelle Sprachbearbeitung" (linguistics and language processing); Zeitschnittkorpus der deutschen Schriftsprache von 1970; Querschnitt durch verschiedene Textsorten
Full literary works (e-text, pdf, facsimile) in selected editions provided with scientific commentary and additional secondary materials; both copyright-free older works (still the lion's share) and new works (by licensing agreement with IPR holders' organizations); appr. 150 titles; planned to grow by 80-100 titles annually
Searchable multilingual text collection (700+ mwd) and a dictionary database of 251 languages and dialects. The Dictionary (ca. 8 mwd) provides translation of a word, definition, grammar, synonym, antonym, image, pronunciation, etc.
As a sub-section of MATEO, MARABU (Mannheimer Reihe Altes Buch) includes illustrated books, (manu)scripts and texts on the history of the Electoral Palatinate. Als Unterkategorie von MATEO beinhaltet MARABU (Mannheimer Reihe Altes Buch) illustrierte Bücher, Handschriften und Rarissima, Quellen zur Geschichte der Kurpfalz sowie Beiträge über Frauen des Humanismus.
MEBA is a lexical aligner, implemented in C#, based on an iterative algorithm that uses pre-processing steps: sentence alignment ([[http://www.clarin.eu/tools/sal-sentence-aligner|SAL]]), tokenization, POS-tagging and lemmatization (through [[http://www.clarin.eu/tools/ttl-tokenizing-tagging-and-lemmatizing-free-running-texts|TTL]], sentence chunking. Similar to YAWA aligner, MEBA generates the links step by step, beginning with the most probable (anchor links). The links to be
added at any later step are supported or restricted by the links created in the previous iterations. The aligner has different weights and different significance thresholds on each feature and iteration. Each of the iterations can be configured to align different categories of tokens (named entities, dates and numbers, content words, functional words, punctuation) in decreasing order of statistical evidence.
MEBA has an individual F-measure of 81.71% and it is currently integrated in the platform [[http://www.clarin.eu/tools/cowal-combined-word-aligner|COWAL]].
More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]:
-- Dan Tufiş (2007). Exploiting Aligned Parallel Corpora in Multilingual Studies and Applications. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Intercultural Collaboration. First International Workshop (IWIC 2007), volume 4568 of Lecture Notes in Computer Science, pp. 103-117. Springer-Verlag, August 2007. ISBN 978-3-540-73999-9.
-- -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2006). Improved Lexical Alignment by Combining Multiple Reified Alignments. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Proceedings of the 11th Conference EACL2006, pp. 153-160, Trento, Italy, April 2006. Association for Computational Linguistics. ISBN 1-9324-32-61-2.
-- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2005). Combined Aligners. In Proceedings of the ACL Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond, pp. 107-110, Ann Arbor, USA, June 2005. Association for Computational Linguistics. ISBN 978-973-703-208-9.
On Mediaevum.de, a collection of links to Middle High German texts can be found. These texts are made available via the University of Virginia. Auf Mediaevum.de findet sich eine Linksammlung zu diversen mittelhochdeutschen Texten, welche als Volltexte über die University of Virginia erreichbar sind.
MBSP is a set of linguistic tools based on the TiMBL and MBT memory based learning applications developed at CNTS and ILK. It provides tools for Part of Speech tagging, Chunking, Lemmatizing, Relation Finding, Named Entity Recognition, and (for medical language) Semantic tagging.
MBT is a memory-based tagger-generator and tagger in one. The tagger-generator part can generate a sequence tagger on the basis of a training set of tagged sequences; the tagger part can tag new sequences. MBT can, for instance, be used to generate part-of-speech taggers or chunkers for natural language processing.