WIT Press


A Fully Sensitive Correlation Measure For Data Mining

Price

Free (open access)

Volume

40

Pages

7

Page Range

35 - 41

Published

2008

Size

195 kb

Paper DOI

10.2495/DATA080041

Copyright

WIT Press

Author(s)

R. J. G. B. Campello & E. R. Hruschka

Abstract

This paper introduces a novel sequence correlation measure that is fully sensitive to both the ranks and magnitudes of the sequences under evaluation. This measure can be more appropriate than the existing ones in those application scenarios in which such a full sensitivity is desired. The applicability of the new measure in data mining tasks is motivated. Keywords: correlation indexes, clustering analysis. 1 Introduction A problem that appears in different contexts of data analysis is that of comparing two sequences A = {a1, a2, . . . ,an} and B = {b1, b2, . . . ,bn} for which there is a total order relation (≤) on their elements. This problem can be addressed by means of correlation indexes, such as the well-known Pearson correlation coefficient [1, 2]. Aside from the huge applicability of such indexes in statistics [3, 4], there are also different possible scenarios for their application to data mining tasks. In this context, one may mention, for instance, the use of sequence correlation indexes for feature selection as a pre-processing step for data clustering or classification [5]. Another scenario for the application of correlation indexes to data clustering or classification is the measurement of similarities in bioinformatics data sets [6]. For example, sequences A and B can refer to the responses of a given pair of genes along a set of experiments (e.g. microarray) [7]. Since the trend of such responses plays a fundamental role to describe the function and behavior of the corresponding genes, correlation indexes have been widely used as measures of similarity when dealing with this sort of data.

Keywords

correlation indexes, clustering analysis.