WIT Press


On High Dimensional Data Spaces

Price

Free (open access)

Volume

28

Pages

Published

2002

Size

663 kb

Paper DOI

10.2495/DATA020251

Copyright

WIT Press

Author(s)

S Dey & S A Roberts

Abstract

Data mining applications usually encounter high dimensional data spaces. Most of these dimensions contain ‘uninteresting’ data, which would not only be of little value in terms of discovery of any rules or patterns, but have been shown to mislead some classification algorithms. Since, the computational effort increases very significantly (usually exponentially) in the presence of a large number of attributes, it is highly desirable that all irrelevant attributes be weeded out at an early stage. Often, patterns of interest are embedded in lower dimensional subspaces of data. If the data space S has k attributes E {a1, a2...ak}, then a n-dimensional subspace s. of the data space S can be formed by selecting a combination of n attributes from the set {a1, a2...ak}, where n < k. It is usual to tackle this problem by getting some attributes and subspaces identified by the user (or domain experts). For even moderately large number of attributes, the number of possible subspaces is so large, that it is quite unlikely that the ‘experts’ would be able to identify all the ‘interesting’ subspaces. 1 Introduction The general problem, known as ‘the curse of high dimensionality’, has been studied extensively and several automatic methods for reduction of dimensionality have been reported in literature. Data mining applications require that: The results be comprehensible by the end-user Data distributions potentially non-conformant to any of the canonical forms be handled Potentially ‘interesting’ subspaces (as opposed to a subset of the original attributes) be identified

Keywords