Data mining nowadays belongs to the most prominent Information
technologies, experiencing a boom of interest from users and software producers. Traditionally, extracting knowledge from data has been a domain of statisticians, and the largest variety of rnethods encountered in commercial data mining systems are actually methods for statistical data analysis tasks. One of the most important ones among them is testing hypotheses about the probability distribution underlying the data. Basically, it consists in checking the null hypothesis that the probability distribution, a priori cissumed to belong to a broad set of distributions, actually belongs to one of its narrow subsets, which must be precisely delimited in advance. However, in a situation in which the data mining is performed, there are seldom enough clues for such a precise delimitation. That is why the generalizations of statistical hypotheses testing to vague hypotheses háve been investigated for more than a decade, so far following the most straightforward way - to replace the set defining the null hypothesis by a fuzzy set. In this páper, a principally different generalization is proposed, based on the observational-logic approach to data mining, and in particular to hypotheses testing. Its key idea is to view statistical testing of a fuzzy hypothesis cis an application of an appropriate generalized quantifier of a fuzzy predicate calculus to predicates describing the data. The theoretical principles of the approach are elaborated for both crisp and fuzzy significance levels, and illustrated on the quantifier lower critical implication, well known from the data mining system Guha. Finally, the implementation of the approach is briefly sketched.
The popularity of feed-forward neural networks such as multilayer perceptrons and radial basis function networks is to a large extent due to their universal approximation capability. This paper concerns its theoretical principles, together with the influence of network architecture and of the distribution of training data on this capability. Then, the possibility to exploit this influence in order to improve the approximation capability of multilayer perceptrons by means of cross-validation and boosting is explained. Although in theory, the impact of both methods on the approximation capability of feed-forward networks is known, they are still not common in real-world applications. Therefore, the paper documents usefulness of both methods on a detailed case study in materials science.
Classifier combining is a popular technique for improving classification quality. Common methods for classifier combining can be further improved by using dynamic classification confidence measures which adapt to the currently classified pattern. However, in the case of dynamic classifier systems, the classification confidence measures need to be studied in a broader context as we show in this paper, the degree of consensus of the whole classifier team plays a key role in the process. We discuss the properties which should hold for a good confidence measure, and we define two methods for predicting the feasibility of a given classification confidence measure to a given classifier team and given data. Experimental results on 6 artificial and 20 real-world benchmark datasets show that for both methods, there is a statistically significant correlation between the feasibility of the measure, and the actual improvement in classification accuracy of the whole classifier system; therefore, both feasibility measures can be used in practical applications to choose an optimal classification confidence measure.