Generalized K-means Algorithm On Nominal Dataset
Free (open access)
43 - 51
S. H. Al-Harbi & A. M. Al-Shahri
Clustering has typically been a problem related to continuous fields. However, in data mining, often the data values are nominal and cannot be assigned meaningful continuous substitutes. The largest advantage of the k-means algorithm in data mining applications is its efficiency in clustering large data sets. The k-means algorithm usually uses the simple Euclidean metric which is only suitable for hyperspherical clusters, and its use is limited to numeric data. This paper extends our work on the DCV metric which was introduced to deal with nominal data, and then demonstrates how the popular k-means clustering algorithm can be profitably modified to deal with the DCV metric. Having adapted the k-means algorithm, the DCV metric will be implemented and the results examined.With this development, it is now possible to improve the results of cluster analyses on nominal data sets. Keywords: clustering, data mining, Mahalanobis metric, DCV metric, Hamming metric, k-means. 1 Introduction A way of extracting information from a large data set is to cluster it. Clustering involves assigning objects into groups such that the objects in a group are similar to each other, but different from the objects in the other groups. Similarity is fundamental to the definition of a cluster and being able to measure the similarity of two objects in the same feature space is essential to most clustering algorithms. In a metric space, the dissimilarity between two objects is modelled with a distance function that satisfies the triangle inequality. It gives a numerical value to the notion of closeness between two objects in a high-dimensional space. More details of metric spaces can be found, for example, in . Applications of clustering exist in diverse areas, e.g.
clustering, data mining, Mahalanobis metric, DCV metric, Hamming metric, k-means.