Fault tolerant algorithms for heat transfer problems

Hatem Ltaief, Edgar Gabriel, Marc Garbey

Research output: Contribution to journalArticlepeer-review

27 Scopus citations

Abstract

With the emergence of new massively parallel systems in the high performance computing area allowing scientific simulations to run on thousands of processors, the mean time between failures of large machines is decreasing from several weeks to a few minutes. The ability of hardware and software components to handle these singular events called process failures is therefore getting increasingly important. In order for a scientific code to continue despite a process failure, the application must be able to retrieve the lost data items. The recovery procedure after failures might be fairly straightforward for elliptic and linear hyperbolic problems. However, the reversibility in time for parabolic problems appears to be the most challenging part because it is an ill-posed problem. This paper focuses on new fault-tolerant numerical schemes for the time integration of parabolic problems. The new algorithm allows the application to recover from process failures and to reconstruct numerically the lost data of the failed process(es) avoiding the expensive roll-back operation required in most checkpoint/restart schemes. As a fault tolerant communication library, we use the fault tolerant message passing interface developed by the Innovative Computing Laboratory at the University of Tennessee. Experimental results show promising performances. Indeed, the three-dimensional parabolic benchmark code is able to recover and to keep on running after failures, adding only a very small penalty to the overall time of execution.

Original languageEnglish (US)
Pages (from-to)663-677
Number of pages15
JournalJournal of Parallel and Distributed Computing
Volume68
Issue number5
DOIs
StatePublished - May 2008

Keywords

  • Parabolic problems
  • Parallel numerical algorithms
  • Process fault tolerance

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Fault tolerant algorithms for heat transfer problems'. Together they form a unique fingerprint.

Cite this