The goal of this paper is to provide an overview of the structure and contents of the soon-to-be available ORAL corpus, which combines previously published corpora (ORAL2006, ORAL2008 and ORAL2013) with newly transcribed material into a single conveniently accessible and more richly annotated resource, about 6 million running words in length. The recordings and corresponding transcripts span a decade between 2002 and 2011; most of them capture interactions of mutually well-acquainted speakers, in informal situations and natural settings. The corpus is complemented by amarginal portion of more formal data, mostly public talks. It is tagged and lemmatized, and an effort was made to adapt existing tools (targeted at written language) to yield better results on spoken data. We hope the availability of such a resource will spawn further discussions on the morphological and syntactic analysis of spoken language, perhaps resulting in more radical departures in the future from the part-of-speech classification inherited from the linguistic analysis of written language.
The MORFO system for morphological analysis of Czech consists of four units: the analyzer, the generator, the dictionary editor, and the library with the shared source code for handling dictionary objects.
This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). These corpora were meant to serve as additional "raw" corpora, to help discovering unseen verbal MWEs.
The corpora are provided in CONLL-U (https://universaldependencies.org/format.html) format. They contain morphosyntactic annotations (parts of speech, lemmas, morphological features, and syntactic dependencies). Depending on the language, the information comes from treebanks (mostly Universal Dependencies v2.x) or from automatic parsers trained on UD v2.x treebanks (e.g., UDPipe).
VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do).
For the 1.2 shared task edition, the data covers 14 languages, for which VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format.
Morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.2 (2020). The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.2
Omorfi is free and open source project containing various tools and data for handling Finnish texts in a linguistically motivated manner. The main components of this repository are:
1) a lexical database containing hundreds of thousands of words (c.f. lexical statistics),
2) a collection of scripts to convert lexical database into formats used by upstream NLP tools (c.f. lexical processing),
3) an autotools setup to build and install (or package, or deploy): the scripts, the database, and simple APIs / convenience processing tools, and
4) a collection of relatively simple APIs for a selection of languages and scripts to apply the NLP tools and access the database
Slovak models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging.
The morphological dictionary is created from MorfFlex SK 170914 and the PoS tagger is trained on automatically translated Prague Dependency Treebank 3.0 (PDT).
The genus Triaenops has been considered monospecific in its a frican and Middle Eastern range (T. persicus), while three other species have been recognised as endemic to Madagascar (T. menamena, T. furculus, and T. auritus), and another to the western Seychelles (T. pauliani). We analysed representative samples of T. persicus from East Africa and the Middle East using both morphological and molecular genetics approaches and compared them with most of the available type material of species of this genus. Morphological comparisons revealed four distinct morphotypes in the set of examined specimens; one in Africa, the others in the Middle East. The Middle Eastern morphotypes differed mainly in size, while the allopatric African form showed differences in skull shape. Two of three Arabian morphotypes occur in sympatry. Cytochrome b gene-based molecular analysis revealed significant divergences (K2P distance 6.4–8.1% in complete cyt b sequence) among most of the morphotypes. Therefore, we propose a split of the current T. persicus rank into three species: T. afer in Africa, and T. persicus and T. parvus sp. nov. in the Middle east. The results of the molecular analysis also indicated relatively close proximity of the Malagasy T. menamena to Arabian T. persicus, suggesting a northern route of colonisation of Madagascar from populations from the Middle east or north-eastern Africa as a plausible alternative to presumed colonisation from east Africa. Due to a considerable genetic distance (21.6–26.2% in 731 bp sequence of cyt b) and substantial morphological differences from the continental forms of Triaenops as well as from Malagasy T. menamena, we propose generic status (Paratriaenops gen. nov.) for the group of Malagasy species, T. furculus, T. auritus, and T. pauliani. We separated the genera Triaenops and Paratriaenops gen. nov. from other hipposiderid bats into Triaenopini trib. nov. recognising their isolated position within the family Hipposideridae Lydekker, 1891.
Dictionaries with different representations for various languages. Representations include brown clusters of different sizes and morphological dictionaries extracted using different morphological analyzers. All representations cover the most frequent 250,000 word types on the Wikipedia version of the respective language.
Analzers used: MAGYARLANC (Hungarian, Zsibrita et al. (2013)), FREELING (English and Spanish, Padro and Stanilovsky (2012)), SMOR (German, Schmid et al. (2004)), an MA from Charles University (Czech, Hajic (2001)) and LATMOR (Latin, Springmann et al. (2014)).