Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Harvested from LINDAT/CLARIAH-CZ repository

351. Corpus for training and evaluating diacritics restoration systems

Creator:: Náplava, Jakub, Straka, Milan, Hajič, Jan, and Straňák, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: diacritical marks generation and natural language correction
Language:: Czech, Vietnamese, Romanian, Polish, Slovak, Spanish, Croatian, Irish, Latvian, Hungarian, French, and Turkish
Description:: Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized. All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better. The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

352. Corpus Nederlandse Gebarentaal (CNGT)

Publisher:: Radboud University Nijmegen
Type:: corpus
Subject:: Linguistics and language technology
Description:: The Corpus NGT is a collection of data from deaf signers using Sign Language of the Netherlands (NGT). The data consist of recordings with multiple synchronised video cameras, accompanied by gloss and translation annotations.
Rights:: Creative Commons BY-NC-SA 3.0 NL license and http://creativecommons.org/licenses/by-nc-sa/3.0/nl/

353. Corpus nineteenth-century Frisian

Publisher:: Frisian Academy
Type:: corpus
Description:: About a million words have been scanned and corrected. In addition, some hand-written manuscripts have been typed into the computer.
Rights:: Not specified

354. Corpus of contemporary blogs

Creator:: Grác, Marek
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: corpus, blogs, annotation, annotators, sentences, and machine learning
Language:: Czech
Description:: In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

355. Corpus of Early English Correspondence Sampler (CEECS)

Publisher:: University of Helsinki
Format:: text/plain
Type:: corpus
Language:: English
Description:: Personal correspondence from England between the years 1418-1680. Compiled as a tool for historical sociolinguistics.
Rights:: Not specified

356. Corpus of Early Literary Finnish

Publisher:: The Research Institute for the Languages of Finland
Type:: corpus
Language:: Finnish
Description:: This corpus contains a variety of works written in Finnish published between 1809 and 1899, such as newspapers, periodicals, almanacs, and decrees. The corpus contains 8,976,561 words and is available for online browsing.
Rights:: Not specified

357. Corpus of Finnish Literary Classics

Publisher:: The Research Institute for the Languages of Finland
Type:: corpus
Language:: Finnish
Description:: The classics of Finnish literature corpus contains works by established Finnish fiction writers from the 1880s to the 1930s. The corpus is part of speech tagged and available for online browsing via the concordancer Korp.
Rights:: Not specified

358. Corpus of Italian Emblem Books

Publisher:: University of Glasgow
Type:: corpus
Language:: Italian
Description:: Italian emblem books from the Stirling Maxwell Collection (University of Glasgow). Transcribed text and photographi reproducitons. Searchable and browsable online
Rights:: Not specified

359. Corpus of Old Literary Finnish

Publisher:: The Research Institute for the Languages of Finland
Type:: corpus
Language:: Finnish
Description:: This is a linguistically unannotated corpus of various historical texts written between 1543 and 1809. The corpus consists of 3,428,618 words and is available for online browsing.
Rights:: Not specified

360. Corpus of Old Written Estonian

Publisher:: University of Tartu
Type:: corpus
Language:: Estonian
Description:: Corpus of texts written fully or partly in Estonian, from 13.-19. century; 1,5 million words
Rights:: Not specified

« Previous
Next »
1
2
…
32
33
34
35
36
37
38
39
40
…
228
229

351. Corpus for training and evaluating diacritics restoration systems

352. Corpus Nederlandse Gebarentaal (CNGT)

353. Corpus nineteenth-century Frisian

354. Corpus of contemporary blogs

355. Corpus of Early English Correspondence Sampler (CEECS)

356. Corpus of Early Literary Finnish

357. Corpus of Finnish Literary Classics

358. Corpus of Italian Emblem Books

359. Corpus of Old Literary Finnish

360. Corpus of Old Written Estonian

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from