Slovak morphological dictionary modeled after the Czech one. It consists of (word form, lemma, POS tag) triples, reusing the Czech morphological system for POS tags and lemma descriptions.
NameTag 2 is a named entity recognition tool. It recognizes named entities (e.g., names, locations, etc.) and can recognize both flat and embedded (nested) entities. NameTag 2 can be used either as a commandline tool or by requesting the NameTag webservice.
NameTag webservice can be found at:
https://lindat.mff.cuni.cz/services/nametag/
NameTag commandline tool can be downloaded from NameTag GitHub repository, branch nametag2:
git clone https://github.com/ufal/nametag -b nametag2
Latest models and documentation can be found at:
https://ufal.mff.cuni.cz/nametag/2
This software subject to the terms of the Mozilla Public License, v. 2.0 (http://mozilla.org/MPL/2.0/). The associated models are distributed under CC BY-NC-SA license.
Please cite as:
Jana Straková, Milan Straka, Jan Hajič (2019): Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5326-5331, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2 (https://aclweb.org/anthology/papers/P/P19/P19-1527/)
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
NomVallex 2.0 is a manually annotated valency lexicon of Czech nouns and adjectives, created in the theoretical framework of the Functional Generative Description and based on corpus data (the SYN series of corpora from the Czech National Corpus and the Araneum Bohemicum Maximum corpus). In total, NomVallex is comprised of 1027 lexical units contained in 570 lexemes, covering the following parts-of-speech and derivational categories: deverbal or deadjectival nouns, and deverbal, denominal, deadjectival or primary adjectives. Valency properties of a lexical unit are captured in a valency frame (modeled as a sequence of valency slots, each supplemented with a list of morphemic forms) and documented by corpus examples. In order to make it possible to study the relationship between valency behavior of base words and their derivatives, lexical units of nouns and adjectives in NomVallex are linked to their respective base lexical units (contained either in NomVallex itself or, in case of verbs, in the VALLEX lexicon), linking up to three parts-of-speech (i.e., noun – verb, adjective – verb, noun – adjective, and noun – adjective – verb).
In order to facilitate comparison, this submission also contains abbreviated entries of the base verbs of these nouns and adjectives from the VALLEX lexicon and simplified entries of the covered nouns and adjectives from the PDT-Vallex lexicon.
The NomVallex I. lexicon describes valency of Czech deverbal nouns belonging to three semantic classes, i.e. Communication (dotaz 'question'), Mental Action (plán 'plan') and Psych State (nenávist 'hatred'). It covers both stem-nominals and root-nominals (dotazování se 'asking' and dotaz 'question'). In total, the lexicon includes 505 lexical units in 248 lexemes. Valency properties are captured in the form of valency frames, specifying valency slots and their morphemic forms, and are exemplified by corpus examples.
In order to facilitate comparison, this submission also contains abbreviated entries of the source verbs of these nouns from the Vallex lexicon and simplified entries of the covered nouns from the PDT-Vallex lexicon.
This package comprises eight models of Czech word embeddings trained by applying word2vec (Mikolov et al. 2013) to the currently most extensive corpus of Czech, namely SYN v9 (Křen et al. 2022). The minimum frequency threshold for including a word in the model was 10 occurrences in the corpus. The original lemmatisation and tagging included in the corpus were used for disambiguation. In the case of word embeddings of word forms, units comprise word forms and their tag from a positional tagset (cf. https://wiki.korpus.cz/doku.php/en:pojmy:tag) separated by '>', e.g., kočka>NNFS1-----A----.
The published package provides models trained on both tokens and lemmas. In addition, the models combine training algorithms (CBOW and Skipgram) and dimensions of the resulting vectors (100 or 500), while the training window and negative sampling remained the same during the training. The package also includes files with frequencies of word forms (vocab-frequencies.forms) and lemmas (vocab-frequencies.lemmas).
The ParCzech 3.0 corpus is the third version of ParCzech consisting of stenographic protocols that record the Chamber of Deputies’ meetings held in the 7th term (2013-2017) and the current 8th term (2017-Mar 2021). The protocols are provided in their original HTML format, Parla-CLARIN TEI format, and the format suitable for Automatic Speech Recognition. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2. The audio files are aligned with the texts in the annotated TEI files.
The ParCzech PS7 1.0 corpus is the very first member of the corpus family of data coming from the Parliament of the Czech Republic. ParCzech PS7 1.0 consists of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between 2013-2017. The audio recordings are available as well. Transcripts are provided in the original HTML as harvested, and also converted into TEI-derived XML format for use in TEITOK corpus manager. The corpus is automatically enriched with the morphological and named-entity annotations using the procedures MorphoDita and NameTag.
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2018101@@LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy@@nationalFunds@@✖[remove]80