WIT Press


Scalability Issue In Mining Large Data Sets

Price

Free (open access)

Volume

33

Pages

9

Published

2004

Size

199 kb

Paper DOI

10.2495/DATA040181

Copyright

WIT Press

Author(s)

A. Mc Manus & M.-T. Kechadi

Abstract

The most distinct characteristic of data mining is that it deals with large data sets. This requires the algorithms used in data mining to be highly scalable. However, most algorithms currently used in data mining do not scale very well when applied to very large data sets because they were initially developed and tested upon smaller data sets. Today, we have such large data sets that these algorithms are no longer efficient enough for mining and analysing. In this paper, we have addressed a data clustering problem and developed a scalable algorithm, which provides very high precision and recall values. This algorithm is used on a very large real-world data-set, TDT2, and experimental results showed that this algorithm performs well compared to traditional ones. 1 Introduction Nowadays, people have great demand on knowledge and information, while information overload becoming one serious problem. News media and publishing industries therefore try to suit customers needs by using electronic information management system. Document clustering algorithm has been introduced to group similar documents together for easier searching and reading. Document clustering algorithm has been widely used in news media and publishing industry, which ensured it effectiveness over manual clustering.With labor cost reduced and time saved, document clustering algorithms provides convenient clusterednews for users. Clustering is an important task that is performed as part of many text mining and information retrieval systems. Clustering can be used for efficiently finding the nearest neighbors of a document [1, 4], for improving the precision or recall in information retrieval systems, for aid in browsing a collection of documents [2],

Keywords