This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained on the Czech Named Entity Corpus 2.0 (https://ufal.mff.cuni.cz/cnec/cnec2.0). NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#czech-cnec2.
This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained jointly on several NE corpora: English CoNLL-2003, German CoNLL-2003, Dutch CoNLL-2002, Spanish CoNLL-2002, Ukrainian Lang-uk, and Czech CNEC 2.0, all harmonized to flat NEs with 4 labels PER, ORG, LOC, and MISC. NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#multilingual-conll.
NomVallex 2.0 is a manually annotated valency lexicon of Czech nouns and adjectives, created in the theoretical framework of the Functional Generative Description and based on corpus data (the SYN series of corpora from the Czech National Corpus and the Araneum Bohemicum Maximum corpus). In total, NomVallex is comprised of 1027 lexical units contained in 570 lexemes, covering the following parts-of-speech and derivational categories: deverbal or deadjectival nouns, and deverbal, denominal, deadjectival or primary adjectives. Valency properties of a lexical unit are captured in a valency frame (modeled as a sequence of valency slots, each supplemented with a list of morphemic forms) and documented by corpus examples. In order to make it possible to study the relationship between valency behavior of base words and their derivatives, lexical units of nouns and adjectives in NomVallex are linked to their respective base lexical units (contained either in NomVallex itself or, in case of verbs, in the VALLEX lexicon), linking up to three parts-of-speech (i.e., noun – verb, adjective – verb, noun – adjective, and noun – adjective – verb).
In order to facilitate comparison, this submission also contains abbreviated entries of the base verbs of these nouns and adjectives from the VALLEX lexicon and simplified entries of the covered nouns and adjectives from the PDT-Vallex lexicon.
The NomVallex I. lexicon describes valency of Czech deverbal nouns belonging to three semantic classes, i.e. Communication (dotaz 'question'), Mental Action (plán 'plan') and Psych State (nenávist 'hatred'). It covers both stem-nominals and root-nominals (dotazování se 'asking' and dotaz 'question'). In total, the lexicon includes 505 lexical units in 248 lexemes. Valency properties are captured in the form of valency frames, specifying valency slots and their morphemic forms, and are exemplified by corpus examples.
In order to facilitate comparison, this submission also contains abbreviated entries of the source verbs of these nouns from the Vallex lexicon and simplified entries of the covered nouns from the PDT-Vallex lexicon.
The NottDeuYTSch corpus contains over 33 million words taken from approximately 3 million YouTube comments from videos published between 2008 to 2018 targeted at a young, German-speaking demographic and represents an authentic language snapshot of young German speakers. The corpus was proportionally sampled based on video category and year from a database of 112 popular German-speaking YouTube channels in the DACH region for optimal representativeness and balance and contains a considerable amount of associated metadata for each comment that enable further longitudinal cross-sectional analyses.
Data
----
We have collected English-Odia parallel and monolingual data from the
available public websites for NLP research in Odia.
The parallel corpus consists of English-Odia parallel Bible, Odia
digital library, and Odisha Goverment websites. It covers bible,
literature, goverment of Odisha and its policies. We have processed the
raw data collected from the websites, performed alignments (a mix of
manual and automatic alignments) and release the corpus in a form ready
for various NLP tasks.
The Odia monolingual data consists of Odia-Wikipedia and Odia e-magazine
websites. Because the major portion of data is extracted from
Odia-Wikipedia, it covers all kinds of domains. The e-magazines data
mostly cover the literature domain. We have preprocessed the monolingual
data including de-duplication, text normalization, and sentence
segmentation to make it ready for various NLP tasks.
Corpus Formats
--------------
Both corpora are in simple tab-delimited plain text files.
The parallel corpus files have three columns:
- the original book/source of the sentence pair
- the English sentence
- the corresponding Odia sentence
The monolingual corpus has a varying number of columns:
- each line corresponds to one *paragraph* (or related unit) of the
original source
- each tab-delimited unit corresponds to one *sentence* in the paragraph
Data Statistics
----------------
The statistics of the current release is given below.
Parallel Corpus Statistics
---------------------------
Dataset Sentences #English tokens #Odia tokens
------- --------- ---------------- -------------
Train 27136 706567 604147
Dev 948 21912 19513
Test 1262 28488 24365
------- --------- ---------------- -------------
Total 29346 756967 648025
Domain Level Statistics
------------------------
Domain Sentences #English tokens #Odia tokens
------------------ --------- ---------------- -------------
Bible 29069 756861 640157
Literature 424 7977 6611
Goverment policies 204 1411 1257
------------------ --------- ---------------- -------------
Total 29697 766249 648025
Monolingual Corpus Statistics
-----------------------------
Paragraphs Sentences #Odia tokens
---------- --------- ------------
71698 221546 2641308
Domain Level Statistics
-----------------------
Domain Paragraphs Sentences #Odia tokens
-------------- -------------- --------- -------------
General (wiki) 30468 (42.49%) 102085 1320367
Literature 41230 (57.50%) 119461 1320941
-------------- -------------- --------- -------------
Total 71698 221546 2641308
Citation
--------
If you use this corpus, please cite it directly (see above), but please cite also the following paper:
Title: OdiEnCorp: Odia-English and Odia-Only Corpus for Machine Translation
Author: Shantipriya Parida, Ondrej Bojar, and Satya Ranjan Dash
Proceedings of the Third International Conference on Smart Computing & Informatics (SCI) 2018
Series: Smart Innovation, Systems and Technologies (SIST)
Publisher: Springer Singapore
Data
-----
We have collected English-Odia parallel data for the purposes of NLP
research of the Odia language.
The data for the parallel corpus was extracted from existing parallel
corpora such as OdiEnCorp 1.0 and PMIndia, and books which contain both
English and Odia text such as grammar and bilingual literature books. We
also included parallel text from multiple public websites such as Odia
Wikipedia, Odia digital library, and Odisha Government websites.
The parallel corpus covers many domains: the Bible, other literature,
Wiki data relating to many topics, Government policies, and general
conversation. We have processed the raw data collected from the books,
websites, performed sentence alignments (a mix of manual and automatic
alignments) and released the corpus in a form suitable for various NLP
tasks.
Corpus Format
-------------
OdiEnCorp 2.0 is stored in simple tab-delimited plain text files, each
with three tab-delimited columns:
- a coarse indication of the domain
- the English sentence
- the corresponding Odia sentence
The corpus is shuffled at the level of sentence pairs.
The coarse domains are:
books ... prose text
dict ... dictionaries and phrasebooks
govt ... partially formal text
odiencorp10 ... OdiEnCorp 1.0 (mix of domains)
pmindia ... PMIndia (the original corpus)
wikipedia ... sentences and phrases from Wikipedia
Data Statistics
---------------
The statistics of the current release are given below.
Note that the statistics differ from those reported in the paper due to
deduplication at the level of sentence pairs. The deduplication was
performed within each of the dev set, test set and training set and
taking the coarse domain indication into account. It is still possible
that the same sentence pair appears more than once within the same set
(dev/test/train) if it came from different domains, and it is also
possible that a sentence pair appears in several sets (dev/test/train).
Parallel Corpus Statistics
--------------------------
Dev Dev Dev Test Test Test Train Train Train
Sents # EN # OD Sents # EN # OD Sents # EN # OD
books 3523 42011 36723 3895 52808 45383 3129 40461 35300
dict 3342 14580 13838 3437 14807 14110 5900 21591 20246
govt - - - - - - 761 15227 13132
odiencorp10 947 21905 19509 1259 28473 24350 26963 704114 602005
pmindia 3836 70282 61099 3836 68695 59876 30687 551657 486636
wikipedia 1896 9388 9385 1917 21381 20951 1930 7087 7122
Total 13544 158166 140554 14344 186164 164670 69370 1340137 1164441
"Sents" are the counts of the sentence pairs in the given set (dev/test/train)
and domain (books/dict/...).
"# EN" and "# OD" are approximate counts of words (simply space-delimited,
without tokenization) in English and Odia
The total number of sentence pairs (lines) is 13544+14344+69370=97258. Ignoring
the set and domain and deduplicating again, this number drops to 94857.
Citation
--------
If you use this corpus, please cite the following paper:
@inproceedings{parida2020odiencorp,
title={OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation},
author={Parida, Shantipriya and Dash, Satya Ranjan and Bojar, Ond{\v{r}}ej and Motlicek, Petr and Pattnaik, Priyanka and Mallick, Debasish Kumar},
booktitle={Proceedings of the WILDRE5--5th Workshop on Indian Language Data: Resources and Evaluation},
pages={14--19},
year={2020}
}
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10).
One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.1; April 2016) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-underspecified logical forms and more abstract, non-lexicalized ‘semantic networks’. The latter of these are formally (if not linguistically) similar to Abstract Meaning Representation (AMR) and are available in a range of serializations, including in AMR-like syntax.
Please use the following bibliographic reference for the SDP 2016 data:
@string{C:LREC = {{I}nternational {C}onference on
{L}anguage {R}esources and {E}valuation}}
@string{LREC:16 = {Proceedings of the 10th } # C:LREC}
@string{L:LREC:16 = {Portoro\v{z}, Slovenia}}
@inproceedings{Oep:Kuh:Miy:16,
author = {Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke
and Zeman, Daniel and Cinkov{\'a}, Silvie
and Flickinger, Dan and Haji\v{c}, Jan
and Ivanova, Angelina and Ure\v{s}ov{\'a}, Zde\v{n}ka},
title = {Towards Comparability of Linguistic Graph Banks for Semantic Parsing},
booktitle = LREC:16
year = 2016,
address = L:LREC:16,
pages = {3991--3995}
}
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10).
One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.2; January 2017) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-underspecified logical forms and more abstract, non-lexicalized ‘semantic networks’. The latter of these are formally (if not linguistically) similar to Abstract Meaning Representation (AMR) and are available in a range of serializations, including in AMR-like syntax.
Version 1.1 was released April 2016. Version 1.2 adds the 2015 Turku system, which was accidentally left out from version 1.1.
Please use the following bibliographic reference for the SDP 2016 data:
@string{C:LREC = {{I}nternational {C}onference on
{L}anguage {R}esources and {E}valuation}}
@string{LREC:16 = {Proceedings of the 10th } # C:LREC}
@string{L:LREC:16 = {Portoro\v{z}, Slovenia}}
@inproceedings{Oep:Kuh:Miy:16,
author = {Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke
and Zeman, Daniel and Cinkov{\'a}, Silvie
and Flickinger, Dan and Haji\v{c}, Jan
and Ivanova, Angelina and Ure\v{s}ov{\'a}, Zde\v{n}ka},
title = {Towards Comparability of Linguistic Graph Banks for Semantic Parsing},
booktitle = LREC:16
year = 2016,
address = L:LREC:16,
pages = {3991--3995}
}
OpenLegalData is a free and open platform that makes legal documents and information available to the public. The aim of this platform is to improve the transparency of jurisprudence with the help of open data and to help people without legal training to understand the justice system. The project is committed to the Open Data principles and the Free Access to Justice Movement.
OpenLegalData's DUMP as of 2022-10-18 was used to create this corpus. The data was cleaned, automatically annotated (TreeTagger: POS & Lemma) and grouped based on the metadata (jurisdiction - BundeslandID - sub-size if applicable - ex: Verwaltungsgerichtsbarkeit_11_05.cec6.gz - jurisdiction: administrative jurisdiction, BundeslandID = 11 - sub-corpus = 05). Sub-corpora are randomly split into 50 MB each.
Corpus data is available in CEC6 format. This can be converted into many different corpus formats - use the software www.CorpusExplorer.de if necessary.