WIT Press


Cluster Analysis In Document Networks

Price

Free (open access)

Volume

40

Pages

10

Page Range

95 - 104

Published

2008

Size

372 kb

Paper DOI

10.2495/DATA080101

Copyright

WIT Press

Author(s)

C. K. dos Santos, A. G. Evsukoff & B. S. L. P. de Lima

Abstract

Text or document clustering is a subset of a larger field of data clustering and has been one of the research hotspots in text mining. On the other hand, recent studies have shown that many real systems may be represented as complex networks with astonishing similar proprieties. In this work a document corpora is represented as a complex network of documents, in which the nodes represent the documents and the edges are weighted according to the similarities among documents. The detection of community structures in complex networks can be seen as the cluster analysis in document networks. Recently community detection algorithms based on spectral proprieties of the underlying has shown good results. The main motivation for applying those methods is that they have shown to be robust to the high dimensionality of feature space and also to the inherent data sparsity resulting from text representation in the vector space model. The aim of this paper is to present the application of the community structures algorithms for text mining. Experiments have been carried out on the document clustering problems taken from 20 newsgroup document corpora to evaluate the performance of the proposed approach. Keywords: text mining, document clustering, complex networks, community detection, spectral clustering. 1 Introduction Unstructured information in document databases presents intrinsic characteristics such that the classical data mining algorithms can be adapted to solve text mining tasks. One of the most usual representations for text mining relies on the vector space information retrieval model of documents [1]. In such a model the order of words is not considered and each document in a collection is represented by a vector, of which the components are related to relevant words appearing in

Keywords

text mining, document clustering, complex networks, community detection, spectral clustering.