WIT Press


Comparing Dissimilarity Measures For Probabilistic Symbolic Objects

Price

Free (open access)

Paper DOI

10.2495/DATA020041

Volume

28

Pages

Published

2002

Size

729 kb

Author(s)

D Malerba, F Esposito & M Monopoli

Abstract

Symbolic data analysis generalizes some standard statistical data mining methods, such as those developed for classification and clustering tasks, to the case of symbolic objects (S0s). These objects, informally defined as \“aggregated data” because they synthesize information concerning a group of individuals of a population, ensure confidentiality of original data, nevertheless they pose new problems which finds a solution in symbolic data analysis. A by-product of working with aggregate data is the possibility of dealing with data from complex questionnaires, where multiple answers are possible or constraints among different answers exists. Comparing SOS is an important step of symbolic data analysis. It can be useful either to cluster some SOS or to discriminate between them, or even to order SOS according to their degree of generalization. This paper presents a comparative study aiming at evaluating the degree of dissimilarity between the objects of a restricted class of symbolic data, namely Probabilistic Symbolic Objects. To define a ground truth for the empirical evaluation, a data set with understandable and explainable properties has been selected. In the experiment, only two dissimilarity measures, among the seven ones we have studied, seems to have a more stable behaviour. 1 Symbolic data analysis Most of statistical data mining techniques are designed for a relatively simple situation: the unit for statistical analysis is an individual (e.g., a person or an object) described by a well defined set of random variables (either qualitative or quantitative), each of which result in just a single value. However, in many situations data analysts cannot access the single individuals (first-order objects). A typical situation is that of census data, which raise privacy issues in all governmental agencies that distribute them. To guarantee that data analysts

Keywords