WIT Press


Generalized K-means Algorithm On Nominal Dataset

Price

Free (open access)

Volume

40

Pages

9

Page Range

43 - 51

Published

2008

Size

341 kb

Paper DOI

10.2495/DATA080051

Copyright

WIT Press

Author(s)

S. H. Al-Harbi & A. M. Al-Shahri

Abstract

Clustering has typically been a problem related to continuous fields. However, in data mining, often the data values are nominal and cannot be assigned meaningful continuous substitutes. The largest advantage of the k-means algorithm in data mining applications is its efficiency in clustering large data sets. The k-means algorithm usually uses the simple Euclidean metric which is only suitable for hyperspherical clusters, and its use is limited to numeric data. This paper extends our work on the DCV metric which was introduced to deal with nominal data, and then demonstrates how the popular k-means clustering algorithm can be profitably modified to deal with the DCV metric. Having adapted the k-means algorithm, the DCV metric will be implemented and the results examined.With this development, it is now possible to improve the results of cluster analyses on nominal data sets. Keywords: clustering, data mining, Mahalanobis metric, DCV metric, Hamming metric, k-means. 1 Introduction A way of extracting information from a large data set is to cluster it. Clustering involves assigning objects into groups such that the objects in a group are similar to each other, but different from the objects in the other groups. Similarity is fundamental to the definition of a cluster and being able to measure the similarity of two objects in the same feature space is essential to most clustering algorithms. In a metric space, the dissimilarity between two objects is modelled with a distance function that satisfies the triangle inequality. It gives a numerical value to the notion of closeness between two objects in a high-dimensional space. More details of metric spaces can be found, for example, in [3]. Applications of clustering exist in diverse areas, e.g.

Keywords

clustering, data mining, Mahalanobis metric, DCV metric, Hamming metric, k-means.