Nic Hollingum University of Sydney Fault Tolerance in Streaming Systems The stream programming model is a viable vehicle for providing high performance computing on distributed heterogeneous (cloud) systems. A stream program is given as a directed graph whose nodes represent independent computational units (actors) and who communicate via unidirectional FIFO buffers (channels). When stream programs are run on cloud machines they are susceptible to the kinds of frequent small-scale failures those machines experience. To solve this problem we use fault-tolerant protocols. We examine the provision of static and dynamic fault-tolerance mechanisms to stream programs running on the cloud. Both general purpose mechanisms, as well as ones specific to streaming computations, are assessed. Replication fault-tolerance relies on having redundant duplicate actors to perform computations. Checkpoint recovery relies on writing the program's state to disk and re-reading it in the event of a failure. Dynamic-switching fault- tolerance is an application of the Borealis fault-tolerance system modified for synchronous streaming programs. We develop an experimental framework to examine the trade-offs and overheads of these mechanisms on distributed systems. The framework provides a time simulator which acts as a stand in for arbitrary streaming computations. Simulated streaming computations are run with fault-tolerance whilst parts of the network are faulted, the throughput and success rates of the computations are used to reason about the effectiveness of the fault-tolerance mechanisms.