Data mining uses automated procedures to extract useful information and insight from large datasets. The prevalence of outliers, and missing or incomplete data, can invalidate the results obtained with standard analysis procedures. A few of these anomalies can have a disproportionate influence on analytical results.
The concepts central to this book are data pretreatment and analytical validation. Data pretreatment addresses the issue of detecting outliers, and their treatment strategies. Analytical validation is concerned with the significance of the results obtained. Detailed discussions of the character, sources, and influence of these anomalies are presented, along with procedures that are known to be sensitive to anomalies, and procedures for detecting them and for analyzing datasets that may contain them. Generalized sensitivity analysis (GSA) is employed effectively to compare procedures known to be sensitive to anomalies and those known to be anomaly-resistant.
The book is organized into eight chapters. The first chapter introduces data anomalies and GSA. The second chapter presents a detailed look at data imperfections, their sources, and their consequences. The third chapter is devoted to the problem of detecting univariate outliers. The fourth chapter considers three main pretreatment tasks: the elimination of noninformative variables, the treatment of missing data values, and the treatment of outliers. Chapter 5 studies the criteria of a good dataset, employing functional equations and inequalities. Chapter 6 is devoted to a complete discussion of GSA, using exchangeability and iterative procedures. It emphasizes that a good data analysis result should be insensitive to small changes, in either the methods or the datasets on which the analysis is based. The seventh chapter describes four subset selection strategies: random subset selection, subset deletion, comparison-based strategies, and systematic approaches that use auxiliary knowledge of a dataset. These strategies play a vital role in the development of computational algorithms for large datasets, the design of moving-window characterizations for time series data, and the stratification of composite datasets. The final chapter presents recent research from a useful perspective, relative to the problem of analyzing large datasets. It examines the role of prior knowledge, auxiliary data, and working assumptions in analyzing data. It also mentions some open problems in data analysis.
This book provides extensive analysis of fresh datasets, in addition to its discussion of published examples. The author has obviously put a lot of effort into this book, with broad, careful coverage of the material. This seminal work will be stimulating and valuable to researchers in developing strategies and tactics for dealing with a number of critically important data imperfections that must be addressed before obtaining useful analysis results from large datasets.