Two Novel Term Weighting For Text Categorization
Free (open access)
105 - 114
L. A. Matsunaga & N. F. F. Ebecken
In text categorization (TC) based on the vector space model, documents are represented as a vector, where each component is associated with a particular term from the text collection vocabulary. Traditionally, each component value is assigned using the information retrieval (IR) TFIDF measure. While this weighting method seems very appropriate for IR, weighting methods that take into account the importance of the term to the discrimination of the categories may provide better results in TC. To apply this idea, we use in this work variants of TFIDF weighting, where the idf part is replaced by functions used to conduct term selection. In an approach on real-world data to automatically distribute the legislative bills to the committees at the Federal District Legislative Assembly in Brasília, Brazil, the replacement of the idf part in TFIDF by a new term selection measure – absl-logit – and by bi-normal separation  produced the best general classification results with support vector machines (SVM), when compared with TFIDF and with the use of common term selection measures – chi-square, information gain, gain ratio and odds ratio – to replace the idf part in TFIDF. Keywords: term weighting, text categorization, text classification. 1 Introduction Text categorization (TC) is the task of automatically assigning unlabelled documents into predefined categories. In TC based on the vector space model, a document is represented as a vector dti = [wi1, ..., wip], where p is the size of the text collection vocabulary (number of terms of the dictionary of terms used).
term weighting, text categorization, text classification.