The Importance Of Adequate Data Pre-processing In Early Diagnosis: Classification Of Arrhythmias, A Case Study
Free (open access)
233 - 242
A. Rabasa, A. F. Compañ, J. J. Rodríguez-Sala & L. Noguera
Data management can become very complex in the context of forecasting medical problems. Data collection, storage and analysis require the highest level of accuracy possible. The successful application of data mining techniques for the early diagnosis of disease or dysfunctions is increasingly more frequent among the scientific communities. However, as in any analytical method, the precision and reliability of the models provided by these techniques is absolutely dependent on the input data. If the quality of these data is not sufficient, the final accuracy can be greatly reduced to the point that the system becomes somewhat unproductive. This paper describes the main problems and how they can be properly solved at the pre-processing stage. Some of issues addressed are, for example: the detection of missing values (due to incomplete records), identification of outliers (often due to errors in measuring or recording devices), and discretization of numerical variables (where the context allows or suggests trying numeric values as nominal segments). Considering a public data base for arrhythmia from the UCI Repository, this study uses free Data Mining software to parameterize and run forecasting models and execute several computational experiments that show how the accuracy of predictions vary according to how you implement the critical pre-processing stage. The paper concludes providing a generic procedure that aims to apply the pre-processing of data in a methodical way and depending on the problems presented by the input data, and how it should be integrated into a global process of data management. Keywords: Data Mining, pre-processing, forecasting, medicine, arrhythmia.
Keywords: Data Mining, pre-processing, forecasting, medicine, arrhythmia.