Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized.
All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better.
The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
The Corpus NGT is a collection of data from deaf signers using Sign Language of the Netherlands (NGT). The data consist of recordings with multiple synchronised video cameras, accompanied by gloss and translation annotations.
In NLP Centre, dividing text into sentences is currently done with
a tool which uses rule-based system. In order to make enough training
data for machine learning, annotators manually split the corpus of contemporary text
CBB.blog (1 million tokens) into sentences.
Each file contains one hundredth of the whole corpus and all data were
processed in parallel by two annotators.
The corpus was created from ten contemporary blogs:
hintzu.otaku.cz
modnipeklo.cz
bloc.cz
aleneprokopova.blogspot.com
blog.aktualne.cz
fuchsova.blog.onaidnes.cz
havlik.blog.idnes.cz
blog.aktualne.centrum.cz
klusak.blogspot.cz
myego.cz/welldone
This corpus contains a variety of works written in Finnish published between 1809 and 1899, such as newspapers, periodicals, almanacs, and decrees.
The corpus contains 8,976,561 words and is available for online browsing.
The classics of Finnish literature corpus contains works by established Finnish fiction writers from the 1880s to the 1930s. The corpus is part of speech tagged and available for online browsing via the concordancer Korp.
Italian emblem books from the Stirling Maxwell Collection (University of Glasgow). Transcribed text and photographi reproducitons. Searchable and browsable online
This is a linguistically unannotated corpus of various historical texts written between 1543 and 1809.
The corpus consists of 3,428,618 words and is available for online browsing.