Description: this xml file describes the Arabic phonetic constraints (rules) resulting from the analysis of the lexicons(Taj Alarous, Al ain, Lisan Al arab, Alwassit and almoassir ). These rules are to be applied to Arabic roots and are classified into a number of categories. Each category has a certain type of constraints as follow: The first category defines that the root must not consist of three identical letters. The second category defines that the root must not start with two repeating letters. The third category lists the letters that must not occur in the same root, regardless of their order. The fourth category lists the letters that may not be used together in a certain order in a root.
ISLRN: 190-535-098-473-3
The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0). and Technology Agency of the Czech Republic, project No. TA01030476
We have created test set for syntactic questions presented in the paper [1] which is more general than Mikolov's [2]. Since we were interested in morphosyntactic relations, we extended only the questions of the syntactic type with exception of nationality adjectives which is already covered completely in Mikolov's test set.
We constructed the pairs more or less manually, taking inspiration in the Czech side of the CzEng corpus [3], where explicit morphological annotation allows to identify various pairs of Czech words (different grades of adjectives, words and their negations, etc.). The word-aligned English words often shared the same properties. Another sources of pairs were acquired from various webpages usually written for learners of English. For example for verb tense, we relied on a freely available list of English verbs and their morphological variations.
We have included 100-1000 different pairs for each question set. The questions were constructed from the pairs similarly as by Mikolov: generating all possible pairs of pairs. This leads to millions of questions, so we randomly selected 1000 instances per question set, to keep the test set in the same order of magnitude. Additionally, we decided to extend set of questions on opposites to cover not only opposites of adjectives but also of nouns and verbs.
General Information:
Data collector: Jean Costa Silva (University of Georgia)
Date of collection: September-December 2022
Manner of collection: Online questionnaire via Qualtrics
Funding: No
Dataset collected from natural dialogs which enables to test the ability of dialog systems to interactively learn new facts from user utterances throughout the dialog. The dataset, consisting of 1900 dialogs, allows simulation of an interactive gaining of denotations and questions explanations from users which can be used for the interactive learning.
The item contains a list of 2,058 noun/verb conversion pairs along with related formations (word-formation paradigms) provided with linguistic features, including semantic categories that characterize semantic relations between the noun and the verb in each conversion pair. Semantic categories were assigned manually by two human annotators based on a set of sentences containing the noun and the verb from individual conversion pairs. In addition to the list of paradigms, the item contains a set of 739 files (a separate file for each conversion pair) annotated by the annotators in parallel and a set of 2,058 files containing the final annotation, which is included in the list of paradigms.
Embeddings from word2vec model described in "From Diachronic to Contextual Lexical Semantic Change: Introducing Semantic Difference Keywords (SDKs) for Discourse Studies". Full reference TBC.
Language acquisition is one of the currently much discussed topics in the field of psycholinguistics. Considerable space for future research can be seen in the development of vocabulary in Czech-speaking children. In our case, we are mainly interested in the meaning, i.e. the content of acquired words (concepts), and the role of so-called semantic features in mental representation.
The intended goal of our research is to bring new information from the above-mentioned area, to confirm or disprove some existing theoretical statements and to compare the results of foreign research with data obtained using the Czech language material. Similar research has been conducted in various world languages, but so far there are not many papers that address the issue in the Czech language environment. As part of our work, a comprehensive database of semantic features for selected concepts has been prepared. This database has been statistically processed and subsequently the data has been analyzed and interpreted on the basis of theories about the development of the child's speech competence. This material, obtained from children aged 8-9 (lower primary school) growing up in a Czech language environment, has been used in the next phase of research, in which an experiment with subjects belonging to the same age category has been performed: in a semantic task based on the phenomenon called semantic priming, the effect of featural similarity of two concepts on decision in a speeded task has been observed.
The results of the research expand the range of information published so far in this scientific field in the Czech environment. This research can provide valuable insights into children's language acquisition issues. The data gathered can also be practically beneficial not only for teachers, psychologists and speech therapists, but also for parents, for example.
We release a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the both plain and tagged corpora. and it is supported by the MosesCore project sponsored by the European Commission’s Seventh Framework Programme (Grant Number 288487).