Studied are differences of two approaches targeted to reveal latent variables in binary data. These approaches assume that the observed high dimensional data are driven by a small number of hidden binary sources combined due to Boolean superposition. The first approach is the Boolean matrix factorization (BMF) and the second one is the Boolean factor analysis (BFA). The two BMF methods are used for comparison. First is the M8 method from the BMDP statistical software package and the second one is the method suggested by Belohlavek \& Vychodil. These two are compared to BFA, especially with the Expectation-maximization Boolean Factor Analysis we had developed earlier has, however, been extended with a binarization step developed here. The well-known bars problem and the mushroom dataset are used for revealing the methods' peculiarities. In particular, the reconstruction ability of the computed factors and the information gain as the measure of dimension reduction was under scrutiny. It was shown that BFA slightly loses to BMF in performance when noise-free signals are analyzed. Conversely, BMF loses considerably to BFA when input signals are noisy.
Bayesian Networks (BNs) are graphical models which represent multivariate joint probability distributions which have been used successfully in several studies in many application areas. BN learning algorithms can be remarkably effective in many problems. The search space for a BN induction, however, has an exponential dimension. Therefore, finding the BN structure that better represents the dependencies among the variables is known to be a NP problem. This work proposes and discusses a hybrid Bayes/Genetic collaboration (VOGAC-MarkovPC) designed to induce Conditional Independence Bayesian Classifiers from data. The main contribution is the use of MarkovPC algorithm in order to reduce the computational complexity of a Genetic Algorithm (GA) designed to explore the Variable Orderings (VOs) in order to optimize the induced classifiers. Experiments performed in a number of datasets revealed that VOGAC-MarkovPC required less than 25% of the time demanded by VOGAC-PC on average. In addition, when concerning the classification accuracy, VOGAC-MarkovPC performed as well as VOGAC-PC did.
The focus of this paper is the application of the genetic programming
framework in the problem of knowledge discovery in databases, more precisely in the task of classification. Genetic programming possesses certain advantages that make it suitable for application in data mining, such as robustness of the algorithm or its convenient structure for rule generation to name a few. This study concentrates on one type of parallel genetic algorithms - cellular (diffusion) model. Emphasis is placed on the improvement of efficiency and scalability of the data mining algorithm, which could be achieved by integrating the algorithm with databases and employing a cellular framework. The cellular model of genetic programming that exploits SQL queries is implemented and applied to the classification task. The results achieve are presented and compared with other machine learning algorithms.
This paper addresses the problem of clustering in large sets discussed in the context of financial time series. The goal is to divide stock market trading rules into several classes so that all the trading rules within the same class lead to similar trading decisions in the same stock market conditions. It is achieved using Kohonen self-organizing maps and the K-means algorithm. Several validity indices are used to validate and assess the clustering. Experiments were carried out on 350 stock market trading rules observed over a period of 1300 time instants.
Data mining is a set of methods for data processing with the aim to obtain non-trivial information not apparent at first glance usually due to the huge data volume or their complexity. This new scientific discipline helps to solve problems of this kind. and Data mining, neboli dolování dat, je soubor metod sloužících ke zpracování dat a získání netriviálních informací v nich obsažených, které nejsou na první pohled zřejmé a ani zkušení odborníci je nejsou schopni odhalit, zejména z důvodu velikosti datových souborů nebo komplexnosti vazeb. Proto vznikl data mining jako vědní obor, který za pomoci moderní výpočetní techniky řeší podobné problémy.
The use of computational intelligence systems such as neural networks, fuzzy set, genetic algorithms, etc., for stock market predictions has been widely established. This paper presents a generic stock pricing prediction model based on a rough set approach. To increase the efficiency of the prediction process, rough sets with Boolean reasoning discretization algorithm is used to discretize the data. The rough set reduction technique is applied to find all the reducts of the data which contains the minimal subset of attributes that are associated with a class label for prediction. Finally, rough sets dependency rules are generated directly from all generated reducts. Rough confusion matrix is used to evaluate the performance of the predicted reducts and classes. Using a data set consisting of the daily movements of a stock traded in Kuwait Stock Exchange, a preliminary assessment indicates that rough sets are shown to be applicable and is an effective tool to achieve this goal. For comparison, the results obtained using the rough set approach were compared to that of the neural networks algorithm and it was shown that the Rough set approach has a higher overall accuracy rate and generates more compact and fewer rules than the neural networks.
The research reported in the paper is a part of a large project aiming
at designing an automatic device for the micro-sleep events detection. In the paper we are interested in the classification of EEG spectrograms with respect to the level of attention (mentation, relaxation, micro-sleep) of a monitored person (a proband). Data mining techniques are ušed for developing a classification model. Namely, GUHA method is employed for this purpose. It is a method of exploratory data analysis established on logical and statistical bases that has been continuously developed for last 40 years in the Czech Republic.
The Cohen’s kappa coefficient is a widely accepted mecisure of agreement on categorical variables and has replaced some older simpler measures. Observational and statistical properties of the kappa coefficient in 2 x 2 tables are investigated. The asymmetrical measure “Cohenized implication” is proposed. The decomposition of the symmetrical measure kappa into two asymmetrical components is shown. These statistically motivated measures are discussed as weakened forms of strict logical notions of equivalence and implication. Applications of kappa and “Cohenized implication” are recommended; on the one hand in the medical research as a supplement to traditional measures of sensitivity and speciíity, on the other hand as quantifiers in the GUHA proceduře ASSOC as a statistically contemporary operationalization of the weakened equivalence.
Intrusion detection systems are increasingly a key part of systems defense. Various approaches to intrusion detection are currently being used, but they are relatively ineffective. Artificial Intelligence plays a driving role in security services. This paper proposes a dynamic model of intelligent intrusion detection system, based on a specific AI approach for intrusion detection. The techniques that are being investigated include fuzzy logic with network profiling, which uses simple data mining techniques to process the network data. The proposed hybrid system combines anomaly and misuse detection. Simple fuzzy rules allow us to construct if-then rules that reflect common ways of describing security attacks. We use DARPA dataset for training and benchmarking.
The paper recalls the McNaughton theorem of fuzzy logic and the algorithms underlying its constructive proofs. It then shows how those algorithms can be combined with the algorithm underlying recent extension of the theorem to piecewise-linear functions with rational coefficients, and points out potential importance of the resulting combined algorithm for data mining. That result is immediately weakened through a complexity analysis of the algorithm that reveals that its worst-case complexity is doubly-exponential.