A Bayesian Approach For Supervised Discretization
Free (open access)
In supervised machine learning, some algorithms are restricted to discrete data and thus need to discretize continuous attributes. In this paper, we present a new discretization method called MODL, based on a Bayesian approach. The MODL method relies on a model space of discretizations and on a prior distribution defined on this model space. This allows the setting up of an evaluation criterion of discretization, which is minimal for the most probable discretization given the data, i.e. the Bayes optimal discretization. We compare this approach with the MDL approach and statistical approaches used in other discretization methods, from a theoretical and experimental point of view. Extensive experiments show that the MODL method builds high quality discretizations. Keywords: supervised learning, data preparation, discretization, Bayesianism. 1 Introduction While real data often comes in mixed format, discrete and continuous, many induction algorithms rely on discrete attributes and need to discretize continuous attributes, i.e. to slice their domain into a finite number of intervals. More generally, using discretization to preprocess continuous attribute often provides many advantages. Discrete values are generally more understandable than continuous values both for users and experts. Many classification algorithms are more accurate and run faster when discretization is used. Discretization of continuous attributes is a problem that has been studied extensively in the past [6, 7, 9, 12, 16]. For example, decision tree algorithms exploit a discretization method to handle continuous attributes. C4.5  uses the information gain based on Shannon entropy. CART  applies the Gini
supervised learning, data preparation, discretization, Bayesianism.