MorfFlex CZ 2.0 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex is a flat list of lemma-tag-wordform triples. For each wordform, full inflectional information is coded in a positional tag. Wordforms are organized into entries (paradigm instances or paradigms in short) according to their formal morphological behavior. The paradigm (set of wordforms) is identified by a unique lemma. Apart from traditional morphological categories, the description also contains some semantic, stylistic and derivational information. For more details see a comprehensive specification of the Czech morphological annotation http://ufal.mff.cuni.cz/techrep/tr64.pdf .
Slovak morphological dictionary modeled after the Czech one. It consists of (word form, lemma, POS tag) triples, reusing the Czech morphological system for POS tags and lemma descriptions.
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
The NomVallex I. lexicon describes valency of Czech deverbal nouns belonging to three semantic classes, i.e. Communication (dotaz 'question'), Mental Action (plán 'plan') and Psych State (nenávist 'hatred'). It covers both stem-nominals and root-nominals (dotazování se 'asking' and dotaz 'question'). In total, the lexicon includes 505 lexical units in 248 lexemes. Valency properties are captured in the form of valency frames, specifying valency slots and their morphemic forms, and are exemplified by corpus examples.
In order to facilitate comparison, this submission also contains abbreviated entries of the source verbs of these nouns from the Vallex lexicon and simplified entries of the covered nouns from the PDT-Vallex lexicon.
Data
----
We have collected English-Odia parallel and monolingual data from the
available public websites for NLP research in Odia.
The parallel corpus consists of English-Odia parallel Bible, Odia
digital library, and Odisha Goverment websites. It covers bible,
literature, goverment of Odisha and its policies. We have processed the
raw data collected from the websites, performed alignments (a mix of
manual and automatic alignments) and release the corpus in a form ready
for various NLP tasks.
The Odia monolingual data consists of Odia-Wikipedia and Odia e-magazine
websites. Because the major portion of data is extracted from
Odia-Wikipedia, it covers all kinds of domains. The e-magazines data
mostly cover the literature domain. We have preprocessed the monolingual
data including de-duplication, text normalization, and sentence
segmentation to make it ready for various NLP tasks.
Corpus Formats
--------------
Both corpora are in simple tab-delimited plain text files.
The parallel corpus files have three columns:
- the original book/source of the sentence pair
- the English sentence
- the corresponding Odia sentence
The monolingual corpus has a varying number of columns:
- each line corresponds to one *paragraph* (or related unit) of the
original source
- each tab-delimited unit corresponds to one *sentence* in the paragraph
Data Statistics
----------------
The statistics of the current release is given below.
Parallel Corpus Statistics
---------------------------
Dataset Sentences #English tokens #Odia tokens
------- --------- ---------------- -------------
Train 27136 706567 604147
Dev 948 21912 19513
Test 1262 28488 24365
------- --------- ---------------- -------------
Total 29346 756967 648025
Domain Level Statistics
------------------------
Domain Sentences #English tokens #Odia tokens
------------------ --------- ---------------- -------------
Bible 29069 756861 640157
Literature 424 7977 6611
Goverment policies 204 1411 1257
------------------ --------- ---------------- -------------
Total 29697 766249 648025
Monolingual Corpus Statistics
-----------------------------
Paragraphs Sentences #Odia tokens
---------- --------- ------------
71698 221546 2641308
Domain Level Statistics
-----------------------
Domain Paragraphs Sentences #Odia tokens
-------------- -------------- --------- -------------
General (wiki) 30468 (42.49%) 102085 1320367
Literature 41230 (57.50%) 119461 1320941
-------------- -------------- --------- -------------
Total 71698 221546 2641308
Citation
--------
If you use this corpus, please cite it directly (see above), but please cite also the following paper:
Title: OdiEnCorp: Odia-English and Odia-Only Corpus for Machine Translation
Author: Shantipriya Parida, Ondrej Bojar, and Satya Ranjan Dash
Proceedings of the Third International Conference on Smart Computing & Informatics (SCI) 2018
Series: Smart Innovation, Systems and Technologies (SIST)
Publisher: Springer Singapore
ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs.
The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
The valency lexicon PDT-Vallex 4.0 has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT, the spoken language corpus (PDTSC) and corpus of user-generated texts in the project Faust). It contains over 14500 valency frames for almost 8500 verbs which occurred in the PDT, PCEDT, PDTSC and Faust corpora. In addition, there are nouns, adjectives and adverbs, linked from the PDT part only, increasing the total to over 17000 valency frames for 13000 words. All the corpora have been published in 2020 as the PDT-C 1.0 corpus with the PDT-Vallex 4.0 dictionary included; this is a copy of the dictionary published as a separate item for those not interested in the corpora themselves. It is available in electronically processable format (XML), and also in more human readable form including corpus examples (see the WEBSITE link below, and the links to its main publications elsewhere in this metadata). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives. It replaces the previously published unversioned edition of PDT-Vallex from 2014.
Prague Czech-English Dependency Treebank - Russian translation (PCEDT-R) is a project of translating a subset of Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) to Russian and linguistically annotating the Russian translations with emphasis on coreference and cross-lingual alignment of coreferential expressions. Cross-lingual comparison of coreference means is currently the purpose that drives development of this corpus.
The current version 0.5 is a preliminary version, which contains (+ denotes new features):
* complete PCEDT 2.0 documents "wsj_1900"-"wsj_1949"
* Czech-English word alignment of coreferential expressions annotated manually mainly on the t-layer
+ Russian translations of the original English sentences
+ automatic tokenization, part-of-speech tagging and morphological analysis for Russian
+ automatic word alignment between all Czech and Russian words
+ manual alignment between Russian and the other two languages on possessive pronouns
The Prague Czech-English Dependency Treebank 2.0 Coref (PCEDT 2.0 Coref) is a parallel treebank building upon the original PCEDT 2.0 release and enriching it with the extended manual annotation of coreference, as well as with an improved automatic annotation of the coreferential expression alignment.
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2015071@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@✖[remove]73