Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
Subordination : Providing resilience to simultaneous failure of multiple cluster nodes. / Gankevich, Ivan; Tipikin, Yuri; Korkhov, Vladimir.
Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 832-838 8035165.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
}
TY - GEN
T1 - Subordination
T2 - The 2017 International Conference on High Performance Computing and Simulation
AU - Gankevich, Ivan
AU - Tipikin, Yuri
AU - Korkhov, Vladimir
N1 - Conference code: 15
PY - 2017/9/12
Y1 - 2017/9/12
N2 - In this paper we describe a new framework for creating distributed programmes which are resilient to cluster node failures. Our main goal is to create a simple and reliable model, that ensures continuous execution of parallel programmes without creation of checkpoints, memory dumps and other I/O intensive activities. To achieve this we introduce multi-layered system architecture, each layer of which consists of unified entities organised into hierarchies, and then show how this system handles different node failure scenarios. We benchmark our system on the example of real-world HPC application on both physical and virtual clusters. The results of the experiments show that our approach has low overhead and scales to a large number of cluster nodes.
AB - In this paper we describe a new framework for creating distributed programmes which are resilient to cluster node failures. Our main goal is to create a simple and reliable model, that ensures continuous execution of parallel programmes without creation of checkpoints, memory dumps and other I/O intensive activities. To achieve this we introduce multi-layered system architecture, each layer of which consists of unified entities organised into hierarchies, and then show how this system handles different node failure scenarios. We benchmark our system on the example of real-world HPC application on both physical and virtual clusters. The results of the experiments show that our approach has low overhead and scales to a large number of cluster nodes.
UR - http://www.scopus.com/inward/record.url?scp=85032352516&partnerID=8YFLogxK
U2 - 10.1109/HPCS.2017.126
DO - 10.1109/HPCS.2017.126
M3 - Conference contribution
AN - SCOPUS:85032352516
SP - 832
EP - 838
BT - Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 July 2017 through 20 July 2017
ER -
ID: 9152966