A Hybrid Method To Categorize HTML Documents
Free (open access)
M. Khordad, M. Shamsfard & F. Kazemeyni
In this paper we introduce a hybrid method for classifying HTML documents. In this method the statistical, semantic and writing style features of text are used to categorize documents. Categorization can be done in both supervised and unsupervised modes and categories may be predefined or be created dynamically (clustering). The classification system exploits an ontology of interesting topics. The ontology which contains categories and their hierarchical relations can be updated automatically during the system’s lifetime. Newly defined categories can be added to the ontology and existing categories can be changed according to the documents received. The statistical part of the method is based on the Rocchio algorithm. The algorithm has been changed to cover the special conditions for dynamic category building, for categorizing with and without training data and for variable length feature vectors. The semantic part of the algorithm exploits Wordnet to substitute words with their corresponding concepts and does some word sense disambiguation tasks prior to clustering. This way documents will be clustered according to their concepts instead of words. The other part of the method considers writing style features of text such as writing in bold/italic style, writing with different (bigger) fonts or occurring words and concepts in special places of the document, such as the title, headers or hyperlinks. In this paper, after a brief overview on existing methods of document classification, the proposed method will be discussed and some experimental results of classifying documents will be shown. Experiments show that the hybrid method results in some improvements in performance (the accuracy). Keywords: data mining, text categorization, clustering, Rocchio algorithm, ontology.
data mining, text categorization, clustering, Rocchio algorithm, ontology.