WIT Press


Naive Rule Induction For Text Classification Based On Key-phrases

Price

Free (open access)

Volume

35

Pages

7

Published

2005

Size

361 kb

Paper DOI

10.2495/DATA050181

Copyright

WIT Press

Author(s)

N. N. Karanikolas & C. Skourlas

Abstract

In this paper we focus on the induction of naive rules for classifying text documents. An algorithm is briefly described for the creation of key-phrases from a given set of documents and these key-phrases are organized and used as features for the automatic classification of new documents. An Authority list of key-phrases is specified by the algorithm containing key-phrases that occur frequently within the documents of only one or a few classes in the training set. In this framework, this last property permitted the creation of naive rules that measure the similarity of new documents with the existing classes. Keywords: text data mining, text classification, instance based learning, rule induction. 1 Introduction Key-phrases or search terms could be defined as sequences of adjacent words within a text window (e.g. five successive words of the text / a sentence) forming a meaningful, descriptive phrase related to the content of the text document. Such terms can be used as features for classifying (text) documents. Since, not every key-phrase is appropriate for discriminating between documents, we have to examine and apply methods for selecting the appropriate ones. Hence, a prerequisite for such a classification method is the use and maintenance of a list of key-phrases, the so-called \“Authority List” Karanikolas and Skourlas [4]. An interesting problem is related to the reduction of the search space that is needed for the extraction of candidate key-phrases. In Classification learning, a learning scheme takes a set of classified examples from which it is expected to learn a way of classifying unseen

Keywords

text data mining, text classification, instance based learning, rule induction.