Cluster Analysis In Document Networks
Free (open access)
95 - 104
C. K. dos Santos, A. G. Evsukoff & B. S. L. P. de Lima
Text or document clustering is a subset of a larger field of data clustering and has been one of the research hotspots in text mining. On the other hand, recent studies have shown that many real systems may be represented as complex networks with astonishing similar proprieties. In this work a document corpora is represented as a complex network of documents, in which the nodes represent the documents and the edges are weighted according to the similarities among documents. The detection of community structures in complex networks can be seen as the cluster analysis in document networks. Recently community detection algorithms based on spectral proprieties of the underlying has shown good results. The main motivation for applying those methods is that they have shown to be robust to the high dimensionality of feature space and also to the inherent data sparsity resulting from text representation in the vector space model. The aim of this paper is to present the application of the community structures algorithms for text mining. Experiments have been carried out on the document clustering problems taken from 20 newsgroup document corpora to evaluate the performance of the proposed approach. Keywords: text mining, document clustering, complex networks, community detection, spectral clustering. 1 Introduction Unstructured information in document databases presents intrinsic characteristics such that the classical data mining algorithms can be adapted to solve text mining tasks. One of the most usual representations for text mining relies on the vector space information retrieval model of documents . In such a model the order of words is not considered and each document in a collection is represented by a vector, of which the components are related to relevant words appearing in
text mining, document clustering, complex networks, community detection, spectral clustering.