Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors
Balasubramanian V., Banerjee P. IEEE Transactions on Software Engineering16 (2):183-196,1990.Type:Article
Date Reviewed: Jul 1 1991

Since the early days of accounting and computing, such error recovery methods as row and column checksums have been inevitable. Generalized and developed further, these simple techniques provide a new perspective on the design of parallel algorithms for execution on hypercube multiprocessors. It turns out that, in many cases, it is possible to redesign parallel algorithms so as to ensure a low-cost online scheme for hardware error detection without any hardware modifications.

The authors investigate various tradeoffs involved in the design of efficient algorithm-based error detection (ABED) schemes: the choice of system-level data encoding, the choice of location and frequency of encoding, minimization of performance overhead caused by the error detection mechanism, and maximization of error coverage. The methodology for investigating such tradeoffs is illustrated by an example application (QR factorization) from numerical linear algebra. Experimental results for this application on a commercially available Intel iPSC-2/D4/MX hypercube multiprocessor reveal the most efficient error detection schemes and lead to some surprising conclusions; for example, increasing the number of checks does not necessarily improve the error coverage.

While the ABED design methodology proposed in the paper is general, the results concerning the most efficient ABED scheme are application-dependent. Such designs for different applications may give rise to library subroutines that run not only efficiently but reliably as well.

Reviewer:  J. Tepandi Review #: CR123789
Bookmark and Share
 
Error Handling And Recovery (D.2.5 ... )
 
 
Parallel Algorithms (G.1.0 ... )
 
 
Parallel Processors (C.1.2 ... )
 
 
General (G.1.0 )
 
 
Multiple Data Stream Architectures (Multiprocessors) (C.1.2 )
 
 
Numerical Linear Algebra (G.1.3 )
 
Would you recommend this review?
yes
no
Other reviews under "Error Handling And Recovery": Date
(N,K) concept fault tolerance
Krol T. IEEE Transactions on Computers 35(4): 339-350, 1986. Type: Article
Nov 1 1987
Error recovery in asynchronous systems
Campbell R., Randell B. IEEE Transactions on Software Engineering SE-12(9): 811-826, 1986. Type: Article
Jul 1 1987
Static analysis to support the evolution of exception structure in object-oriented systems
Robillard M., Murphy G. ACM Transactions on Software Engineering and Methodology 12(2): 191-221, 2003. Type: Article
Nov 25 2003
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy