Fault-Tolerant Dynamic Task Graph Scheduling

Mehmet Can Kurt, Sriram Krishnamoorthy, Kunal Agrawal, Gagan Agrawal

Research output: Contribution to journalConference articlepeer-review

24 Scopus citations

Abstract

In this paper, we present an approach to fault tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. From users, we elicit the basic task graph structure in terms of successor and predecessor relationships. The work-stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and metadata associated with a task get corrupted. We use this redundancy, and knowledge of the task graph structure, to selectively recover from faults with low space and time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.

Original languageEnglish
Article number7013046
Pages (from-to)719-730
Number of pages12
JournalInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume2015-January
Issue numberJanuary
DOIs
StatePublished - Jan 16 2014
EventInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014 - New Orleans, United States
Duration: Nov 16 2014Nov 21 2014

Keywords

  • cilk
  • dag
  • fault tolerance
  • task graphs
  • work stealing

Fingerprint

Dive into the research topics of 'Fault-Tolerant Dynamic Task Graph Scheduling'. Together they form a unique fingerprint.

Cite this