Slovak morphological dictionary modeled after the Czech one. It consists of (word form, lemma, POS tag) triples, reusing the Czech morphological system for POS tags and lemma descriptions.
A dictionary of morphologically segmented word forms in Czech. Rules of manual segmentation are described in Pelegrinová, K., Mačutek, J., Čech, R. (2021). The Menzerath-Altmann law as the relation between lengths of words and morphemes in Czech. Jazykovedný časopis, 72, 405-414. The dictionary is based on short stories, fairy tales, letters and studies written by Karel Čapek.
A dictionary of morphologically segmented word forms in Czech. Rules of manual segmentation are described in Pelegrinová, K., Mačutek, J., Čech, R. (2021). The Menzerath-Altmann law as the relation between lengths of words and morphemes in Czech. Jazykovedný časopis, 72, 405-414. The dictionary is based on short stories, fairy tales, letters and studies written by Karel Čapek.
General Information:
Data collector: Jean Costa Silva (University of Georgia)
Date of collection: September-December 2022
Manner of collection: Online questionnaire via Qualtrics
Funding: No
Multilingual lexical database that follows the model proposed by the EuroWordNet project. The MCR integrates into the same EuroWordNet framework wordnets from five different languages (together with four English WordNet versions). It also integrates WordNet Domains and new versions of the Base Concepts and Top Concept Ontology. Overall, it contains 1,642,389 semantic relations between synsets, most of them acquired by automatic means. Information contained: semantics, synonyms, antonyms, definition, equivalents, example of use, morphology.
This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi, Italian, Polish, Brazilian Portuguese, Romanian, Swedish, Turkish, Chinese).
They were trained with the Word2Vec algorithm, in its skip-gram version, on PARSEME raw corpora automatically annotated for morpho-syntax (http://hdl.handle.net/11234/1-3367).
These corpora were annotated by Seen2Seen, a rule-based VMWE identifier, one of the leading tools of the PARSEME shared task version 1.2.
VMWE tokens were merged into single tokens.
The format of the vector space files is that of the original Word2Vec implementation by Mikolov et al. (2013), i.e. a binary format.
For compression, bzip2 was used.
Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a collection of texts, selected to be representative of Arabic stemming tasks and manually annotated.
The Narrangansett corpus contains a cultural linguistic dictionary and grammatical information on the Narrangansett language, an extinct language of the USA.