Performance Of Information Retrieval Models Using Term Co-occurrences
Free (open access)
G. Desjardins 1 , R. Godin 1 & R. Proulx 2
Many advanced models have been developed for information retrieval in recent years. These models are built on various artificial intelligence paradigms to improve the precision of the retrieval. Most of them exploit some form of term co-occurrences to improve retrieval quality. In this paper, we compare the retrieval performance of five of these models: the Extended Boolean model, the Generalized Vector Space model, the Frequent Set model, the Rough Set model and a Genetic-Based model. These models are tested on three sub-collections from TREC (Text REtrieval Conference). We analyze the specificity of the models regarding the form of co-occurrences introduced and report on the retrieval performance and the scalability of each model. Keywords: text mining, information retrieval, co-occurrences, extended Boolean, generalized vector space, frequent set, rough set, genetic algorithm. 1 Introduction Term co-occurrences embed major correlation information among the documents of collections. This information can be used to improve the precision at the core level of the retrieval engines. Many models try to capture this information and incorporate it to their output representation in order to increase the effectiveness of the retrieval engine. For this research, we have selected five retrieval models that exploit term co-occurrences: the Extended Boolean model, the Generalized Vector Space model, the Frequent Set model, the Rough Set model and a Genetic-Based model [1–5]. The next section reviews the principles of each model. Section 3 describes the
text mining, information retrieval, co-occurrences, extended Boolean, generalized vector space, frequent set, rough set, genetic algorithm.