Robust Clustering Methods For Incomplete And Erroneous Data
Free (open access)
T. Kärkkäinen & S. Äyrämö
In this paper, reliable methods for clustering erroneous and incomplete data per se (e.g. without imputation) are considered. For this purpose, the usual K-means algorithm is generalized by using robust location estimates and special projection technique. Numerical comparison of the resulting methods with simulated data are presented and analyzed. Keywords: robust clustering, erroneous and incomplete data, K-means. 1 Introduction Clustering, by definition, is a descriptive technique, which is widely used, for example, in statistics, machine learning, pattern recognition, data mining (DM) and Knowledge Discovery in Databases (KDD) [1, 2, 3, 4, 5, 6]. Undoubtedly, it can be considered as a core method of DM and KDD, but the number of different clustering methods is huge. The main idea behind all of these methods is to group similar objects into the same cluster and dissimilar objects into separate clusters. Similarity (or dissimilarity) is measured by a suitable distance function. However, clustering is a challenging task since it includes many choices, such as, the decision between basic approaches (hierarchical, partitioning, density-based, modelbased, grid-based, fuzzy etc.), the selection of an initialization method, the choice of a distance measure, and fixing of a cluster representation technique. These all are dependable on the nature of that particular context, in which the method is intended to be applied. The variety of application fields is also remarkable. Although efficient systems for data gathering have been developed, most of collected real-world data sets use to be incomplete and erroneous [7, 6]. Robust techniques are by construction more suitable to such data sets and better quality of results compared to traditional clustering methods can be expected. However, as T. Kärkkäinen & S. Äyrämö
robust clustering, erroneous and incomplete data, K-means.