WIT Press


A Genetic Algorithm For Text Mining

Price

Free (open access)

Volume

35

Pages

10

Published

2005

Size

404 kb

Paper DOI

10.2495/DATA050141

Copyright

WIT Press

Author(s)

G. Desjardins, R. Godin & R. Proulx

Abstract

Text workers should find ways of representing huge amounts of text in a more compact form. Textual documents can be represented by concepts. One way to define the concepts is by the terms, keywords extracted from the textual documents and cleaned by several processes like stopwords and stemming. Using the frequencies of the terms, one can quantify the relations between documents or portions of text. These relations can serve many applications, like information retrieval or automatic text classification. Another way to define the concepts is by the sets of correlated terms rather then by raw terms. Correlated terms usually have a more specific meaning. Finding meaningful concepts within a huge collection of corpuses in a reasonable timeframe is a difficult task to accomplish. This paper describes a new text mining process to uncover interesting term correlations. The process uses a genetic algorithm to cope with the combinatorial explosion of the term sets. The genetic algorithm identifies combinations of terms that optimize an objective function, which is the cornerstone of the process. We have tested a function designed to optimize the discriminating power of the term sets. The genetic model was tested on a TREC sub-collection. The parameters were set to discover a thousand combinations of correlated terms. These sets of terms were further added to the basic index and applied to the information retrieval problem. The experiment revealed that the augmented index was unable to improve the effectiveness of the retrieval, when compared with the vector space model. Keywords: genetic algorithm, co-occurrences, information retrieval, text mining.

Keywords

genetic algorithm, co-occurrences, information retrieval, text mining.