Harvested from: LINDAT/CLARIAH-CZ repository / Language: Czech - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Language Czech Harvested from LINDAT/CLARIAH-CZ repository Date 2011

Creator:: Grác, Marek
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: corpus, blogs, annotation, annotators, sentences, and machine learning
Language:: Czech
Description:: In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

Creator:: Grác, Marek and Čapek, Tomáš
Publisher:: Masaryk University, NLP Centre
Type:: text, lexicalConceptualResource, and wordnet
Subject:: semantic net and semantic tagging
Language:: Czech
Description:: Semantic net `sholva' contains more than 150 000 records for which there was sufficient agreement among annotators. Indvidual words are labeled in the following categories: person, person / individual, event and substance.
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

Search