The dataset contains delimitation of borders of dialect regions, subgroups, areas and types in the Czech Republic. It is the result of an extensive expert revision that was based on various sources and made the delimitation exact and accurate. At the same time, the dataset corresponds to the underlying data of the Mapka application running at https://korpus.cz/mapka/
There are four files in this submission. Two files contain the delimitation of dialect regions ("oblasti"; both in GeoJSON and Shapefile formats) and two files contain the delimitation of smaller dialect areas, i.e. subgroups, areas and types ("oblasti_jemne"; again in GeoJSON and Shapefile formats).
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Changes in version 1.1:
1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset.
2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0.
3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
Coordinates work of a group of linguists selecting appropriate parse trees from many generated ones. Assigns parts of the task, signalling differences in annotation and allowing them to be corrected by a supervisor.
DeriNet is a lexical network which contains derivational relations in Czech modeled as an oriented graph. Nodes correspond to Czech lexemes (a lexeme is a single lemma, possibly with only a subset of its senses – homonyms may have different derivations and are thus represented by several lexemes) and edges represent derivations between them. DeriNet 1.0 contains 968,967 lexemes with 965,535 unique lemmas; connected by 715,729 derivational links. Lexemes in DeriNet 1.0 are sampled from the MorfFlex dictionary.
DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes (i.e. single lemmas, possibly with only a subset of their senses), edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.2, contains 1,003,590 lexemes (sampled from the MorfFlex dictionary) with 1,001,394 unique lemmas, connected by 740,750 derivational links. Both rather technical and linguistic changes were made as compared to the previous version of the data; e.g. new version of the MorfFlex dictionary was used, derived words that contain a consonant and/or vowel alternation (e.g. boží) were connected with their base word (bůh).
DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.5, contains 1,011,965 lexemes (sampled from the MorfFlex dictionary) connected by 785,543 derivational links. Besides several rather conservative updates (such as newly identified prefix and suffix verb-to-verb derivations as well as noun-to-adjective derivations manifested by most frequent adjectival suffixes), DeriNet 1.5 is the first version that contains annotations related to compounding (compound words are distinguished by a special mark in their part-of-speech labels).
DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.6, contains 1,027,832 lexemes (sampled from the MorfFlex dictionary) connected by 803,404 derivational links. Furthermore, starting with version 1.5, DeriNet contains annotations related to compounding (compound words are distinguished by a special mark in their part-of-speech labels).
Compared to version 1.5, version 1.6 was expanded by extracting potential links from dictionaries available under suitable licences, such as Wiktionary, and by enlarging the number of marked compounds.