Speech corpus comprising 4608 spoken sentences recorded for speech timing research. The complete archive, available for downloading, includes a structured list of the sentences, the speech recordings and the label files, plus full documentation.
eXist-db is an open source database management system entirely built on XML technology. It stores XML data according to the XML data model and features efficient, index-based XQuery processing.
This reference corpus of written Slovenian is a precursor to the Gigafida corpora (see http://hdl.handle.net/11356/1320 for version 2.0).
It contains 600 million words and 738.5 million tokens. In terms of annotation, it is tagged for morphosyntactic descriptors (MSD tags) and lemmatised.
Open source language analysis tool suite: tokenizer, stemmer/lemmatizer, named entity recognizer, chunker/segmenter, morphosyntactic tagger, syntactic tagger, corpus processer, morphological tagger, semantic tagger, analyzer, Word Sense Disambiguator.