A Statistical And Signal Processing Based System For Data Quality Management
Free (open access)
A. C. H. Dantas, J. M. de Seixas, F. B. Diniz & T. N. Ferreira
Research in data quality is getting more important as databases in research centers and companies get larger. Therefore, developing new mechanisms to discover knowledge in large data bases is as urgent as finding ways to measure and assure the quality of the data. This work describes the development of a Quality Control System, intended to operate on dynamic and large data bases. Techniques used vary from standard statistical tests to signal processing methodology, such as filtering, wavelet transformation and neural processing. Data used for this work consist of a social database and of time series for five years of stock data for companies present in the SP500 index. Generalization tests show that feed forward neural networks represent a suitable tool for tracking pre-processed (filtered) financial series, and can be used to define a corridor inside which one may consider new data as acceptable. For this data, we were also able to develop a model for the distribution of the differences between consecutive days, which can be combined to neural processing for data acceptation. Tests performed on the social data allowed us to identify probabilistic density functions for a set of variables, making it possible to create a objective test of data quality assessment. Keywords: data quality, quality control system, statistical tests, signal processing techniques, multidimensional data, financial time series. 1 Introduction Specialists say that this century is surely the century of data . Poor quality customer data costs U.S. business US$611 billion a year . It is then easy to see that data is a critical asset in the information economy, and that the quality of a company’s data is a good predictor of its future success. This means that we are entering a data driven era, and that our databases must now be treated as real com-
data quality, quality control system, statistical tests, signal processing techniques, multidimensional data, financial time series.