Nic Hollingum
University of Sydney

Fault Tolerance in Streaming Systems

The stream programming model is a viable vehicle for providing high
performance computing on distributed heterogeneous (cloud) systems. A stream
program is given as a directed graph whose nodes represent independent
computational units (actors) and who communicate via unidirectional FIFO
buffers (channels). When stream programs are run on cloud machines they are
susceptible to the kinds of frequent small-scale failures those machines
experience. To solve this problem we use fault-tolerant protocols.

We examine the provision of static and dynamic fault-tolerance mechanisms to
stream programs running on the cloud.  Both general purpose mechanisms, as
well as ones specific to streaming computations, are assessed.  Replication
fault-tolerance relies on having redundant duplicate actors to perform
computations.  Checkpoint recovery relies on writing the program's state to
disk and re-reading it in the event of a failure.  Dynamic-switching fault-
tolerance is an application of the Borealis fault-tolerance system modified
for synchronous streaming programs.

We develop an experimental framework to examine the trade-offs and overheads
of these mechanisms on distributed systems.  The framework provides a time
simulator which acts as a stand in for arbitrary streaming computations.
Simulated streaming computations are run with fault-tolerance whilst parts of
the network are faulted, the throughput and success rates of the computations
are used to reason about the effectiveness of the fault-tolerance mechanisms.