Since the early days of accounting and computing, such error recovery methods as row and column checksums have been inevitable. Generalized and developed further, these simple techniques provide a new perspective on the design of parallel algorithms for execution on hypercube multiprocessors. It turns out that, in many cases, it is possible to redesign parallel algorithms so as to ensure a low-cost online scheme for hardware error detection without any hardware modifications.
The authors investigate various tradeoffs involved in the design of efficient algorithm-based error detection (ABED) schemes: the choice of system-level data encoding, the choice of location and frequency of encoding, minimization of performance overhead caused by the error detection mechanism, and maximization of error coverage. The methodology for investigating such tradeoffs is illustrated by an example application (QR factorization) from numerical linear algebra. Experimental results for this application on a commercially available Intel iPSC-2/D4/MX hypercube multiprocessor reveal the most efficient error detection schemes and lead to some surprising conclusions; for example, increasing the number of checks does not necessarily improve the error coverage.
While the ABED design methodology proposed in the paper is general, the results concerning the most efficient ABED scheme are application-dependent. Such designs for different applications may give rise to library subroutines that run not only efficiently but reliably as well.