This corpus constitutes all sentences representing the Arabic Controlled Language (ACL). It contains 551 sentences taken from four textbooks and websites dedicated to teach Arabic language to kids such as: a) First grade book, Republic of Sudan (كتاب الصف الاول جمهورية السودان), b) Al Jazeera Educational Site (موقع الجزيرة التعليمي), c) Bella Preparatory School Girls Forum (منتدى مدرسة بيلا الاعدادية بنات), and d) Albahr website (موقع انا البحر). These sentences are respecting 52 ACL rules. The average number of sentences for each rule is 10.6. All sentences in the corpus were analyzed by Farasa syntactic parser to confirm they are correctly analyzed. The validity of the parsing was done manually by linguist experts.
The structure of this corpus is made of a header and a body. The header consists of a set of metadata that describe the corpus, such as the corpus name, the authors, the sources and further meta data. While the header is made of metadata, the body contains rules. Each rule has a code, a structure and all sentences respecting that rule. For each sentence, we store an id, the vowelledand unvowelled text as well as the result of parsing using Farasa.
The Dictionary of Medieval Latin in the Czech Lands registers and explains the vocabulary of Medieval Latin as used in the Czech lands since the beginnings of Latin writing in this area (from about 1000 CE) to 1500 CE, so far covering the letters A-M. For more information about the Dictionary, see the webpage of the Department of Medieval Lexicography of the Institute of Philosophy of Czech Academy of Sciences.
The data uploaded present the on-line version of the dictionary (API and XML data), making it possible to put the application into operation at a localhost.
Model trained for Czech POS Tagging and Lemmatization using Czech version of BERT model, RobeCzech. Model is trained on data from Prague Dependency Treebank 3.5. Model is a part of Czech NLP with Contextualized Embeddings master thesis and presented a state-of-the-art performance on the date of submission of the work.
Demo jupyter notebook is available on the project GitHub.
Language acquisition is one of the currently much discussed topics in the field of psycholinguistics. Considerable space for future research can be seen in the development of vocabulary in Czech-speaking children. In our case, we are mainly interested in the meaning, i.e. the content of acquired words (concepts), and the role of so-called semantic features in mental representation.
The intended goal of our research is to bring new information from the above-mentioned area, to confirm or disprove some existing theoretical statements and to compare the results of foreign research with data obtained using the Czech language material. Similar research has been conducted in various world languages, but so far there are not many papers that address the issue in the Czech language environment. As part of our work, a comprehensive database of semantic features for selected concepts has been prepared. This database has been statistically processed and subsequently the data has been analyzed and interpreted on the basis of theories about the development of the child's speech competence. This material, obtained from children aged 8-9 (lower primary school) growing up in a Czech language environment, has been used in the next phase of research, in which an experiment with subjects belonging to the same age category has been performed: in a semantic task based on the phenomenon called semantic priming, the effect of featural similarity of two concepts on decision in a speeded task has been observed.
The results of the research expand the range of information published so far in this scientific field in the Czech environment. This research can provide valuable insights into children's language acquisition issues. The data gathered can also be practically beneficial not only for teachers, psychologists and speech therapists, but also for parents, for example.
Sentiment analysis models for Czech language. Models are three Czech sentiment analysis datasets(http://liks.fav.zcu.cz/sentiment/): Mall, CSFD, Facebook, and joint data from all three datasets above, using Czech version of BERT model, RobeCzech.
We present the best model for every dataset. Mall and CSFD models are new state-of-the-art for respective data.
Demo jupyter notebook is available on the project GitHub.
These models are a part of Czech NLP with Contextualized Embeddings master thesis.