A Hybrid Method To Categorize
HTML Documents

M. Khordad; M. Shamsfard; F. Kazemeyni

doi:10.2495/DATA050331

WIT Press

A Hybrid Method To Categorize HTML Documents

Price

Free (open access)

Transaction

WIT Transactions on Information and Communication Technologies

Volume

Pages

Published

2005

Size

425 kb

Paper DOI

10.2495/DATA050331

WIT Press

Author(s)

M. Khordad, M. Shamsfard & F. Kazemeyni

Abstract

In this paper we introduce a hybrid method for classifying HTML documents. In this method the statistical, semantic and writing style features of text are used to categorize documents. Categorization can be done in both supervised and unsupervised modes and categories may be predefined or be created dynamically (clustering). The classification system exploits an ontology of interesting topics. The ontology which contains categories and their hierarchical relations can be updated automatically during the system’s lifetime. Newly defined categories can be added to the ontology and existing categories can be changed according to the documents received. The statistical part of the method is based on the Rocchio algorithm. The algorithm has been changed to cover the special conditions for dynamic category building, for categorizing with and without training data and for variable length feature vectors. The semantic part of the algorithm exploits Wordnet to substitute words with their corresponding concepts and does some word sense disambiguation tasks prior to clustering. This way documents will be clustered according to their concepts instead of words. The other part of the method considers writing style features of text such as writing in bold/italic style, writing with different (bigger) fonts or occurring words and concepts in special places of the document, such as the title, headers or hyperlinks. In this paper, after a brief overview on existing methods of document classification, the proposed method will be discussed and some experimental results of classifying documents will be shown. Experiments show that the hybrid method results in some improvements in performance (the accuracy). Keywords: data mining, text categorization, clustering, Rocchio algorithm, ontology.

Keywords

data mining, text categorization, clustering, Rocchio algorithm, ontology.

Keep me updated

View Book

WIT Press, Ashurst Lodge, Ashurst, Southampton SO40 7AA, UK. Registered in England as a limited company No. 4741634

Connect with WIT Press: