DeriNet is a lexical network which contains derivational relations in Czech modeled as an oriented graph. Nodes correspond to Czech lexemes (a lexeme is a single lemma, possibly with only a subset of its senses – homonyms may have different derivations and are thus represented by several lexemes) and edges represent derivations between them. DeriNet 1.0 contains 968,967 lexemes with 965,535 unique lemmas; connected by 715,729 derivational links. Lexemes in DeriNet 1.0 are sampled from the MorfFlex dictionary.
DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes (i.e. single lemmas, possibly with only a subset of their senses), edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.2, contains 1,003,590 lexemes (sampled from the MorfFlex dictionary) with 1,001,394 unique lemmas, connected by 740,750 derivational links. Both rather technical and linguistic changes were made as compared to the previous version of the data; e.g. new version of the MorfFlex dictionary was used, derived words that contain a consonant and/or vowel alternation (e.g. boží) were connected with their base word (bůh).
DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational relations between a derived word and its base word. The present version, DeriNet 1.5, contains 1,011,965 lexemes (sampled from the MorfFlex dictionary) connected by 785,543 derivational links. Besides several rather conservative updates (such as newly identified prefix and suffix verb-to-verb derivations as well as noun-to-adjective derivations manifested by most frequent adjectival suffixes), DeriNet 1.5 is the first version that contains annotations related to compounding (compound words are distinguished by a special mark in their part-of-speech labels).
MorfFlex CZ 2.0 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex is a flat list of lemma-tag-wordform triples. For each wordform, full inflectional information is coded in a positional tag. Wordforms are organized into entries (paradigm instances or paradigms in short) according to their formal morphological behavior. The paradigm (set of wordforms) is identified by a unique lemma. Apart from traditional morphological categories, the description also contains some semantic, stylistic and derivational information. For more details see a comprehensive specification of the Czech morphological annotation http://ufal.mff.cuni.cz/techrep/tr64.pdf .
Czech translation of WordSim353. The Czech translation of English WordSim353 word pairs were obtained from four translators. All translation variants were scored according to the lexical similarity/relatedness annotation instructions for WordSim353 annotators, by 25 Czech annotators. The resulting data set consists of two annotation files: "WordSim353-cs.csv" and "WordSim-cs-Multi.csv". Both files are encoded in UTF-8, have a header, text is enclosed in double quotes, and columns are separated by commas. The rows are numbered. The WordSim-cs-Multi data set has rows numbered from 1 to 634, whereas the row indices in the WordSim353-cs data set reflect the corresponding row numbers in the WordSim-cs-Multi data set.
The WordSim353-cs file contains a one-to-one mapping selection of 353 Czech equivalent pairs whose judgments have proven to be most similar to the judgments of their corresponding English originals (compared by the absolute value of the difference between the means over all annotators in each language counterpart). In one case ("psychology-cognition"), two Czech equivalent pairs had identical means as well as confidence intervals, so we randomly selected one.
The "WordSim-cs-Multi.csv" file contains human judgments for all translation variants.
In both data sets, we preserved all 25 individual scores. In the WordSim353-cs data set, we added a column with their Czech means as well as a column containing the original English means and 95% confidence intervals in separate columns for each mean (computed by the CI function in the Rmisc R package). The WordSim-cs-Multi data set contains only the Czech means and confidence intervals. For the most convenient lexical search, we provided separate columns with the respective Czech and English single words, entire word pairs, and eventually an English-Czech quadruple in both data sets.
The data set also contains an xls table with the four translations and a preliminary selection of the best variants performed by an adjudicator.