Computing Reviews, the leading online review service for computing literature.

Search

ASSURE: automatic software self-healing using rescue points
Sidiroglou S., Laadan O., Perez C., Viennot N., Nieh J., Keromytis A. ACM SIGPLAN Notices44 (3):37-48,2009.Type:Article

Date Reviewed: May 19 2009

Software crashes due to programming bugs constitute a major problem for systems that need to be available 24 hours a day, seven days a week. Many authors are researching techniques that allow computer applications to detect their own failures and recover from them automatically. “Recovering” means that the application rolls back to a safe state and returns a reasonable error code to the client; recovering in such a way improves system availability, while an appropriate patch is created by a programmer. Sidiroglou et al. have devised a tool called ASSURE that allows server applications that run on Linux 2.6 systems to recover from their failures. ASSURE is innovative insofar as it can deal with applications that are available in binary form only, run on multiple threads and processes, handle polymorphic or encrypted input, or have deterministic bugs (not necessarily memory leaks); furthermore, it does not require any modifications to the underlying operating system. The authors have tested ASSURE on a number of actual bugs in well-known server applications, such as Apache, Squid, and MySQL. They prove that the tool is very efficient. ASSURE builds on so-called rescue points, which are functions that return integer error codes or null pointers when a known error is detected. Rescue points and error codes are identified automatically by running the application in a testing environment where it is fed invalid inputs. When a failure is detected for the first time, ASSURE analyzes the problem in a sandbox. First, it determines what function failed. Then, it identifies the closest rescue point to which the application can be rolled back to keep working well. A piece of code is then inserted at this rescue point that returns an error code, thus preventing the application from continuing and failing. Finally, the application is restarted. The paper is not at all difficult to read, although it does not provide enough details for other scientists to repeat the work. The authors’ writing style is didactic and they get to the point very straightforwardly. They also make it very clear what their original contributions are. I recommend this paper and ASSURE for system administrators who need to keep their servers highly available.

Reviewer: Rafael Corchuelo	Review #: CR136852 (1001-0055)

Error Handling And Recovery (D.2.5 ... )

Would you recommend this review?

yes

Other reviews under "Error Handling And Recovery":	Date

(N,K) concept fault tolerance Krol T. IEEE Transactions on Computers 35(4): 339-350, 1986. Type: Article	Nov 1 1987

Error recovery in asynchronous systems Campbell R., Randell B. IEEE Transactions on Software Engineering SE-12(9): 811-826, 1986. Type: Article	Jul 1 1987

Static analysis to support the evolution of exception structure in object-oriented systems Robillard M., Murphy G. ACM Transactions on Software Engineering and Methodology 12(2): 191-221, 2003. Type: Article	Nov 25 2003

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy