Dataset collected from natural dialogs which enables to test the ability of dialog systems to interactively learn new facts from user utterances throughout the dialog. The dataset, consisting of 1900 dialogs, allows simulation of an interactive gaining of denotations and questions explanations from users which can be used for the interactive learning.
Restaurant Reviews CZ ABSA - 2.15k reviews with their related target and category
The work done is described in the paper: https://doi.org/10.13053/CyS-20-3-2469
Language acquisition is one of the currently much discussed topics in the field of psycholinguistics. Considerable space for future research can be seen in the development of vocabulary in Czech-speaking children. In our case, we are mainly interested in the meaning, i.e. the content of acquired words (concepts), and the role of so-called semantic features in mental representation.
The intended goal of our research is to bring new information from the above-mentioned area, to confirm or disprove some existing theoretical statements and to compare the results of foreign research with data obtained using the Czech language material. Similar research has been conducted in various world languages, but so far there are not many papers that address the issue in the Czech language environment. As part of our work, a comprehensive database of semantic features for selected concepts has been prepared. This database has been statistically processed and subsequently the data has been analyzed and interpreted on the basis of theories about the development of the child's speech competence. This material, obtained from children aged 8-9 (lower primary school) growing up in a Czech language environment, has been used in the next phase of research, in which an experiment with subjects belonging to the same age category has been performed: in a semantic task based on the phenomenon called semantic priming, the effect of featural similarity of two concepts on decision in a speeded task has been observed.
The results of the research expand the range of information published so far in this scientific field in the Czech environment. This research can provide valuable insights into children's language acquisition issues. The data gathered can also be practically beneficial not only for teachers, psychologists and speech therapists, but also for parents, for example.
Supplementary files for a comparative study of word-formation without the addition of derivational affixes (conversion) in English and Czech.
The two .csv files contain 300 verb-noun conversion pairs in English and 300 verb-noun conversion pairs in Czech, i.e. pairs where either the noun is created from the verb or the verb is created from the noun without the use of derivational affixes. In English, the noun and verb in the conversion pair have the same form. In Czech, the noun and verb in the conversion pair differ in inflectional affixes.
The pairs are supplied with manual semantic annotation based on cognitive event schemata.
A file with the Appendix includes a list of dictionary definition phrases used as a basis for the semantic annotation.
Sentiment analysis models for Czech language. Models are three Czech sentiment analysis datasets(http://liks.fav.zcu.cz/sentiment/): Mall, CSFD, Facebook, and joint data from all three datasets above, using Czech version of BERT model, RobeCzech.
We present the best model for every dataset. Mall and CSFD models are new state-of-the-art for respective data.
Demo jupyter notebook is available on the project GitHub.
These models are a part of Czech NLP with Contextualized Embeddings master thesis.
Semantic net `sholva' contains more than 150 000 records for which there was sufficient agreement among annotators. Indvidual words are labeled in the following categories:
person, person / individual, event and substance.
An XML-based file containing Arabic Stop-words respecting nouns syntax; particle nouns, signal nouns, separated pronouns and connected nouns
Citation: Driss Namly, Yasser Regragui, Karim Bouzoubaa. "Interoperable Arabic language resources building and exploitation in SAFAR platform". 13th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA) November 29th to December 2nd, 2016.
Our Laboratory of Artificial Neural Network Applications (LANNA) in the Czech Technical University in Prague (head of the laboratory is professor Jana Tučková) collaborates on a project with the Department of Paediatric Neurology, 2nd Faculty of Medicine of Charles University in Prague and with the Motol University Hospital (head of clinic is professor Vladimír Komárek), which focuses on the study of children with SLI.
The speech database contains two subgroups of recordings of children's speech from different types of speakers. The first subgroup (healthy) consists of recordings of children without speech disorders; the second subgroup (patients) consists of recordings of children with SLI. These children have different degrees of severity (1 – mild, 2 – moderate, and 3 – severe). The speech therapists and specialists from Motol Hospital decided upon this classification. The children’s speech was recorded in the period 2003-2013. These databases were commonly created in a schoolroom or a speech therapist’s consulting room, in the presence of surrounding background noise. This situation simulates the natural environment in which the children live, and is important for capturing the normal behavior of children. The database of healthy children’s speech was created as a referential database for the computer processing of children’s speech. It was recorded on the SONY digital Dictaphone (sampling frequency, fs = 16 kHz, 16-bit resolution in stereo mode in the standardized wav format) and on the MD SONY MZ-N710 (sampling frequency, fs = 44.1 kHz, 16-bit resolution in stereo mode in the standardized wav format). The corpus was recorded in the natural environment of a schoolroom and in a clinic. This subgroup contains a total of 44 native Czech participants (15 boys, 29 girls) aged 4 to 12 years, and was recorded during the period 2003–2005. The database of children with SLI was recorded in a private speech therapist’s office. The children’s speech is captured by means of a SHURE lapel microphone using the solution by the company AVID (MBox – USB AD/DA converter and ProTools LE software) on an Apple laptop (iBook G4). The sound recordings are saved in the standardized wav format. The sampling frequency is set to 44.1 kHz with 16-bit resolution in mono mode. This subgroup contains a total of 54 native Czech participants (35 boys, 19 girls) aged 6 to 12 years, and was recorded during the period 2009–2013. This package contains wav data sets for development and testing methods for detection children with SLI.
Software pack:
FORANA - was developed the original software FORANA for formants analysis. It is based on the MATLAB programming environment. The development of this software was mainly driven by the need to have the ability to complete formant analysis correctly and full automation of the process of extracting formants from the recorded speech signals. Development of this application is still running. Software was developed in the LANNA at CTU FEE in Prague.
LABELING - the program LABELING is used for segmentation of the speech signal. It is a part of SOMLab program system. Software was developed in the LANNA at CTU FEE in Prague.
PRAAT - is an acoustic analysis software. The Praat program was created by Paul Boersma and David Weenink of the Institute of Phonetics Sciences of the University of Amsterdam. Home page: http://www.praat.org or http://www.fon.hum.uva.nl/praat/.
Talks of Karel Makoň given to his friends in the course of late sixties through early nineties of the 20th century. The topic is mostly christian mysticism.
Simple question answering database version 3.2 (SQAD v3.2) created from Czech Wikipedia. The new version consists of more than 16000 records. Each record of SQAD consists of multiple files - question, answer extraction, answer selection, URL, question metadata, and in some cases, answer context.