Abstract

In this paper we describe a new framework for creating distributed programmes which are resilient to cluster node failures. Our main goal is to create a simple and reliable model, that ensures continuous execution of parallel programmes without creation of checkpoints, memory dumps and other I/O intensive activities. To achieve this we introduce multi-layered system architecture, each layer of which consists of unified entities organised into hierarchies, and then show how this system handles different node failure scenarios. We benchmark our system on the example of real-world HPC application on both physical and virtual clusters. The results of the experiments show that our approach has low overhead and scales to a large number of cluster nodes.

Original languageEnglish
Title of host publicationProceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages832-838
Number of pages7
ISBN (Electronic)9781538632505
DOIs
Publication statusPublished - 12 Sep 2017
EventThe 2017 International Conference on High Performance Computing and Simulation - Genoa
Duration: 16 Jul 201720 Jul 2017
Conference number: 15
http://hpcs2017.cisedu.info/

Conference

ConferenceThe 2017 International Conference on High Performance Computing and Simulation
Abbreviated titleHPCS 2017
CountryItaly
CityGenoa
Period16/07/1720/07/17
Internet address

Scopus subject areas

  • Computer Science Applications
  • Information Systems and Management
  • Modelling and Simulation
  • Computer Networks and Communications
  • Computer Science (miscellaneous)

Fingerprint Dive into the research topics of 'Subordination: Providing resilience to simultaneous failure of multiple cluster nodes'. Together they form a unique fingerprint.

Cite this