BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160725Z
LOCATION:D166
DTSTART;TZID=America/Chicago:20181111T103000
DTEND;TZID=America/Chicago:20181111T110000
UID:submissions.supercomputing.org_SC18_sess174_ws_exampi102@linklings.com
SUMMARY:Tree-Based Fault-Tolerant Collective Operations for MPI
DESCRIPTION:Workshop\nExascale, MPI, Networks, System Software, Workshop R
 eg Pass\n\nTree-Based Fault-Tolerant Collective Operations for MPI\n\nMarg
 olin\n\nWith the increase in size and complexity of high-performance compu
 ting systems, the probability of failures and the cost of recovery grow. P
 arallel applications running on these systems should be able to continue r
 unning in spite of node failures at arbitrary times. Collective operations
  are essential for many parallel MPI applications, and are often the first
  to detect such failures. This work presents tree-based fault-tolerant col
 lective operations, which combine fault detection and recovery as an integ
 ral part each operation. We do this by extending existing tree-based algor
 ithms, to allow for a collective operation to succeed despite failing node
 s before or during its run. This differs from other approaches, where reco
 very takes place after a failure of such operations have failed. The paper
  includes a comparison between the performance of the proposed algorithm a
 nd other approaches, as well as a simulator-based analysis of performance 
 at scale.
URL:https://sc18.supercomputing.org/presentation/?id=ws_exampi102&sess=ses
 s174
END:VEVENT
END:VCALENDAR