WIT Press


A Comparison Of Some Classification Techniques

Price

Free (open access)

Paper DOI

10.2495/DATA020551

Volume

28

Pages

Published

2002

Size

561 kb

Author(s)

P S S Coelho & N F F Ebecken

Abstract

The classification activity assigns labels, or classes, to differentiate object groups. In general, these labels are well known beforehand through objects already classified. In Data Mining tasks, the objects are records, i.e., they are described using a set of attributes. These attributes can have any nature (categorical or continuous). The objective is to establish models to characterize the classes of the records using its attributes (values, distribution, pattern, etc.). Many different techniques for the record classification task are available today. These techniques are differentiated by the heuristics they use. In this article a comparison is made of some of the most popular classification techniques. This includes Decision Trees, Bayesian Algorithms (Statistical Methods), and the Classification Based on Rule Induction, also Classification Based on Association Rules. To compare these techniques, the Predictive Accuracy Criteria was mainly used. The Speed, Robustness, Scalability and Interpretability Aspects are also argued, but they had not been quantified for a mathematical comparison. The classification models had been determined from two relational tables with real data. The first one of them is composite with data about meteorological conditions in the region of the International Airport of Rio de Janeiro. This table has 26482 records with 19 variables (one of them is the class label). The second one is about an insurance company, having 130143 registers with 63 independent variables (attributes) and one dependent variable (label of the class). These data tables were prepared earlier. The result of this comparison can be seen in some tables. 1 Introduction It can be considered that the activities of Data Mining are concentrated in development of models that represent some knowledge contained in the data

Keywords