WIT Press

On Extending F-measure And G-mean Metrics To Multi-class Problems


Free (open access)

Paper DOI









570 kb


R. P. Espíndola & N. F. F. Ebecken


The evaluation of classifiers is not an easy task. There are various ways of testing them and measures to estimate their performance. The great majority of these measures were defined for two-class problems and there is not a consensus about how to generalize them to multiclass problems. This paper proposes the extension of the F-measure and G-mean in the same fashion as carried out with the AUC. Some datasets with diverse characteristics are used to generate fuzzy classifiers and C4.5 trees. The most common evaluation metrics are implemented and they are compared in terms of their output values: the greater the response the more optimistic the measure. The results suggest that there are two well-behaved measures in opposite roles: one is always optimistic and the other always pessimistic. Keywords: classification, classifier evaluation, ROC graphs, AUC, F-measure, G-mean. 1 Introduction Classification [1] is an important task in all knowledge fields. It consists of classifying elements described by a fixed set of attributes into one of a finite set of categories or classes. For example: to diagnose a disease of a person by his medical exams or to identify a potential customer of a product by his purchases. Several artificial intelligence approaches have been applied to this problem like artificial neural networks, decision trees and production rules systems. In order to test a classifier or a methodology, a researcher may choose some techniques such as leave-one-out, hold-out, bootstrap and cross-validation. Kohavi [2] performed large-scale experiments to compare two of them, bootstrap and cross-validation, and concluded that 10-fold stratified cross-validation was


classification, classifier evaluation, ROC graphs, AUC, F-measure, G-mean.