A Genetic Algorithm For Text Mining
Free (open access)
G. Desjardins, R. Godin & R. Proulx
Text workers should find ways of representing huge amounts of text in a more compact form. Textual documents can be represented by concepts. One way to define the concepts is by the terms, keywords extracted from the textual documents and cleaned by several processes like stopwords and stemming. Using the frequencies of the terms, one can quantify the relations between documents or portions of text. These relations can serve many applications, like information retrieval or automatic text classification. Another way to define the concepts is by the sets of correlated terms rather then by raw terms. Correlated terms usually have a more specific meaning. Finding meaningful concepts within a huge collection of corpuses in a reasonable timeframe is a difficult task to accomplish. This paper describes a new text mining process to uncover interesting term correlations. The process uses a genetic algorithm to cope with the combinatorial explosion of the term sets. The genetic algorithm identifies combinations of terms that optimize an objective function, which is the cornerstone of the process. We have tested a function designed to optimize the discriminating power of the term sets. The genetic model was tested on a TREC sub-collection. The parameters were set to discover a thousand combinations of correlated terms. These sets of terms were further added to the basic index and applied to the information retrieval problem. The experiment revealed that the augmented index was unable to improve the effectiveness of the retrieval, when compared with the vector space model. Keywords: genetic algorithm, co-occurrences, information retrieval, text mining.
genetic algorithm, co-occurrences, information retrieval, text mining.