WIT Press


Two Novel Term Weighting For Text Categorization

Price

Free (open access)

Volume

40

Pages

10

Page Range

105 - 114

Published

2008

Size

388 kb

Paper DOI

10.2495/DATA080111

Copyright

WIT Press

Author(s)

L. A. Matsunaga & N. F. F. Ebecken

Abstract

In text categorization (TC) based on the vector space model, documents are represented as a vector, where each component is associated with a particular term from the text collection vocabulary. Traditionally, each component value is assigned using the information retrieval (IR) TFIDF measure. While this weighting method seems very appropriate for IR, weighting methods that take into account the importance of the term to the discrimination of the categories may provide better results in TC. To apply this idea, we use in this work variants of TFIDF weighting, where the idf part is replaced by functions used to conduct term selection. In an approach on real-world data to automatically distribute the legislative bills to the committees at the Federal District Legislative Assembly in Brasília, Brazil, the replacement of the idf part in TFIDF by a new term selection measure – absl-logit – and by bi-normal separation [1] produced the best general classification results with support vector machines (SVM), when compared with TFIDF and with the use of common term selection measures – chi-square, information gain, gain ratio and odds ratio – to replace the idf part in TFIDF. Keywords: term weighting, text categorization, text classification. 1 Introduction Text categorization (TC) is the task of automatically assigning unlabelled documents into predefined categories. In TC based on the vector space model, a document is represented as a vector dti = [wi1, ..., wip], where p is the size of the text collection vocabulary (number of terms of the dictionary of terms used).

Keywords

term weighting, text categorization, text classification.