Naive Rule Induction For Text Classification Based On Key-phrases
Free (open access)
N. N. Karanikolas & C. Skourlas
In this paper we focus on the induction of naive rules for classifying text documents. An algorithm is briefly described for the creation of key-phrases from a given set of documents and these key-phrases are organized and used as features for the automatic classification of new documents. An Authority list of key-phrases is specified by the algorithm containing key-phrases that occur frequently within the documents of only one or a few classes in the training set. In this framework, this last property permitted the creation of naive rules that measure the similarity of new documents with the existing classes. Keywords: text data mining, text classification, instance based learning, rule induction. 1 Introduction Key-phrases or search terms could be defined as sequences of adjacent words within a text window (e.g. five successive words of the text / a sentence) forming a meaningful, descriptive phrase related to the content of the text document. Such terms can be used as features for classifying (text) documents. Since, not every key-phrase is appropriate for discriminating between documents, we have to examine and apply methods for selecting the appropriate ones. Hence, a prerequisite for such a classification method is the use and maintenance of a list of key-phrases, the so-called \“Authority List” Karanikolas and Skourlas . An interesting problem is related to the reduction of the search space that is needed for the extraction of candidate key-phrases. In Classification learning, a learning scheme takes a set of classified examples from which it is expected to learn a way of classifying unseen
text data mining, text classification, instance based learning, rule induction.