Computing Reviews, the leading online review service for computing literature.

Search

Mining imperfect data : dealing with contamination and incomplete records
Pearson R., Society for Industrial and Applied Mathematics, Philadelphia, PA, 2005. 305 pp. Type: Book (9780898715828)

Date Reviewed: Oct 27 2005

Data mining uses automated procedures to extract useful information and insight from large datasets. The prevalence of outliers, and missing or incomplete data, can invalidate the results obtained with standard analysis procedures. A few of these anomalies can have a disproportionate influence on analytical results. The concepts central to this book are data pretreatment and analytical validation. Data pretreatment addresses the issue of detecting outliers, and their treatment strategies. Analytical validation is concerned with the significance of the results obtained. Detailed discussions of the character, sources, and influence of these anomalies are presented, along with procedures that are known to be sensitive to anomalies, and procedures for detecting them and for analyzing datasets that may contain them. Generalized sensitivity analysis (GSA) is employed effectively to compare procedures known to be sensitive to anomalies and those known to be anomaly-resistant. The book is organized into eight chapters. The first chapter introduces data anomalies and GSA. The second chapter presents a detailed look at data imperfections, their sources, and their consequences. The third chapter is devoted to the problem of detecting univariate outliers. The fourth chapter considers three main pretreatment tasks: the elimination of noninformative variables, the treatment of missing data values, and the treatment of outliers. Chapter 5 studies the criteria of a good dataset, employing functional equations and inequalities. Chapter 6 is devoted to a complete discussion of GSA, using exchangeability and iterative procedures. It emphasizes that a good data analysis result should be insensitive to small changes, in either the methods or the datasets on which the analysis is based. The seventh chapter describes four subset selection strategies: random subset selection, subset deletion, comparison-based strategies, and systematic approaches that use auxiliary knowledge of a dataset. These strategies play a vital role in the development of computational algorithms for large datasets, the design of moving-window characterizations for time series data, and the stratification of composite datasets. The final chapter presents recent research from a useful perspective, relative to the problem of analyzing large datasets. It examines the role of prior knowledge, auxiliary data, and working assumptions in analyzing data. It also mentions some open problems in data analysis. This book provides extensive analysis of fresh datasets, in addition to its discussion of published examples. The author has obviously put a lot of effort into this book, with broad, careful coverage of the material. This seminal work will be stimulating and valuable to researchers in developing strategies and tactics for dealing with a number of critically important data imperfections that must be addressed before obtaining useful analysis results from large datasets.

Reviewer: P.R. Parthasarathy	Review #: CR131945 (0609-0904)

Data Mining (H.2.8 ... )

Information Filtering (H.3.3 ... )

Statistical Computing (G.3 ... )

Information Search And Retrieval (H.3.3 )

Probability And Statistics (G.3 )

Would you recommend this review?

yes

Other reviews under "Data Mining":	Date

Feature selection and effective classifiers Deogun J. (ed), Choubey S., Raghavan V. (ed), Sever H. (ed) Journal of the American Society for Information Science 49(5): 423-434, 1998. Type: Article	May 1 1999

Rule induction with extension matrices Wu X. (ed) Journal of the American Society for Information Science 49(5): 435-454, 1998. Type: Article	Jul 1 1998

Predictive data mining Weiss S., Indurkhya N., Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998. Type: Book (9781558604032)	Feb 1 1999

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy