WIT Press


Feature Selection Using Support Vector Machines

Price

Free (open access)

Volume

28

Pages

Published

2002

Size

763 kb

Paper DOI

10.2495/DATA020271

Copyright

WIT Press

Author(s)

J Brank, M Grobelnik, N Milic-Frayling & D Mladenic

Abstract

Text categorization is the task of classifying natural language documents into a set of predefine categories. Documents are typically represented by sparse vectors under the vector space model, where each word in the vocabulary is mapped to one coordinate axis and its occurrence in the document gives rise to one nonzero component in the vector representing that document. When training classifiers on large collections of documents, both the time and memory requirements connected with processing of these vectors may be prohibitive. This calls for using a feature selection method, not only to reduce the number of features but also to increase the sparsity of document vectors. We propose a feature selection method based on linear Support Vector Machines (SVMS). First, we train the linear SVM on a subset of training data and retain only those features that correspond to highly weighted components (in absolute value sense) of the normal to the resulting hyperplane that separates positive and negative examples. This reduced feature space is then used to train a classifier over a larger training set because more documents now fit into the same amount of memory. In our experiments we compare the effectiveness of the SVM-based feature selection with that of more traditional feature selection methods, such as odds ratio and information gain, in achieving the desired tradeoff between the vector sparsity and the classification performance. Experimental results indicate that, at the same level of vector sparsity, feature selection based on SVM normals yields better classification performance than odds ratio- or information gain-based feature selection when linear SVM classifiers are used. Introduction Trends towards personalizing information services and client-based applications have increased the importance of effective and efficient document categorization

Keywords