Identifying Character Non-independence In Phylogenetic Data Using Parallelized Rule Induction From Coverings
Free (open access)
J. Leopold, A. Maglia, M. Thakur, B. Patel & F. Ercal
Undiscovered relationships in a data set may confound analyses, particularly those that assume data independence. Such problems occur when characters used for phylogenetic analyses are not independent of one another. Although a data mining technique known as rule induction from coverings has earlier been shown to be a promising approach for identifying such non-independence, its inherent computational complexity has limited its application for large phylogenetic data sets. Herein we present a parallelized implementation of the rule induction from coverings strategy which overcomes some of these limitations. We also discuss two heuristics that have been applied to the algorithm to further improve its efficiency. Keywords: data mining, phylogenetics, parallelization. 1 Introduction Some types of data analyses require and/or assume independence between items in the data set. If this assumption is violated, the results of the analysis may be incorrect. For example, such a problem can occur in phylogenetic analyses (i.e., the reconstruction of evolutionary interrelationships between biological species). A phylogenetic data set consists of rows representing different taxa and columns representing characters or attributes of the taxa. Phylogenetic inference methods such as maximum likelihood and parsimony are based on the assumption that each character in the data set serves as an independent hypothesis of evolution [1, 2]. If this assumption is not true, then correlated or non-independent characters can effectively be overweighted in analyses , and the resulting
data mining, phylogenetics, parallelization.