Subordination › SPbU Researchers Portal

DOI

https://doi.org/10.1109/HPCS.2017.126
Final published version

In this paper we describe a new framework for creating distributed programmes which are resilient to cluster node failures. Our main goal is to create a simple and reliable model, that ensures continuous execution of parallel programmes without creation of checkpoints, memory dumps and other I/O intensive activities. To achieve this we introduce multi-layered system architecture, each layer of which consists of unified entities organised into hierarchies, and then show how this system handles different node failure scenarios. We benchmark our system on the example of real-world HPC application on both physical and virtual clusters. The results of the experiments show that our approach has low overhead and scales to a large number of cluster nodes.

Original language	English
Title of host publication	Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	832-838
Number of pages	7
ISBN (Electronic)	9781538632505
DOIs	https://doi.org/10.1109/HPCS.2017.126
State	Published - 12 Sep 2017
Event	The 2017 International Conference on High Performance Computing and Simulation - Genoa, Italy Duration: 16 Jul 2017 → 20 Jul 2017 Conference number: 15 http://hpcs2017.cisedu.info/

Conference

Conference	The 2017 International Conference on High Performance Computing and Simulation
Abbreviated title	HPCS 2017
Country/Territory	Italy
City	Genoa
Period	16/07/17 → 20/07/17
Internet address	http://hpcs2017.cisedu.info/

Scopus subject areas

Computer Science Applications
Information Systems and Management
Modelling and Simulation
Computer Networks and Communications
Computer Science (miscellaneous)

ID: 9152966

Subordination: Providing resilience to simultaneous failure of multiple cluster nodes

DOI

Conference

Scopus subject areas