Computing Reviews, the leading online review service for computing literature.

Search

Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors
Balasubramanian V., Banerjee P. IEEE Transactions on Software Engineering16 (2):183-196,1990.Type:Article

Date Reviewed: Jul 1 1991

Since the early days of accounting and computing, such error recovery methods as row and column checksums have been inevitable. Generalized and developed further, these simple techniques provide a new perspective on the design of parallel algorithms for execution on hypercube multiprocessors. It turns out that, in many cases, it is possible to redesign parallel algorithms so as to ensure a low-cost online scheme for hardware error detection without any hardware modifications. The authors investigate various tradeoffs involved in the design of efficient algorithm-based error detection (ABED) schemes: the choice of system-level data encoding, the choice of location and frequency of encoding, minimization of performance overhead caused by the error detection mechanism, and maximization of error coverage. The methodology for investigating such tradeoffs is illustrated by an example application (QR factorization) from numerical linear algebra. Experimental results for this application on a commercially available Intel iPSC-2/D4/MX hypercube multiprocessor reveal the most efficient error detection schemes and lead to some surprising conclusions; for example, increasing the number of checks does not necessarily improve the error coverage. While the ABED design methodology proposed in the paper is general, the results concerning the most efficient ABED scheme are application-dependent. Such designs for different applications may give rise to library subroutines that run not only efficiently but reliably as well.

Reviewer: J. Tepandi	Review #: CR123789

Error Handling And Recovery (D.2.5 ... )

Parallel Algorithms (G.1.0 ... )

Parallel Processors (C.1.2 ... )

General (G.1.0 )

Multiple Data Stream Architectures (Multiprocessors) (C.1.2 )

Numerical Linear Algebra (G.1.3 )

Would you recommend this review?

yes

Other reviews under "Error Handling And Recovery":	Date

(N,K) concept fault tolerance Krol T. IEEE Transactions on Computers 35(4): 339-350, 1986. Type: Article	Nov 1 1987

Error recovery in asynchronous systems Campbell R., Randell B. IEEE Transactions on Software Engineering SE-12(9): 811-826, 1986. Type: Article	Jul 1 1987

Static analysis to support the evolution of exception structure in object-oriented systems Robillard M., Murphy G. ACM Transactions on Software Engineering and Methodology 12(2): 191-221, 2003. Type: Article	Nov 25 2003

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy