Standard

Subordination : Providing resilience to simultaneous failure of multiple cluster nodes. / Gankevich, Ivan; Tipikin, Yuri; Korkhov, Vladimir.

Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 832-838 8035165.

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

Harvard

Gankevich, I, Tipikin, Y & Korkhov, V 2017, Subordination: Providing resilience to simultaneous failure of multiple cluster nodes. in Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017., 8035165, Institute of Electrical and Electronics Engineers Inc., pp. 832-838, The 2017 International Conference on High Performance Computing and Simulation, Genoa, Italy, 16/07/17. https://doi.org/10.1109/HPCS.2017.126

APA

Gankevich, I., Tipikin, Y., & Korkhov, V. (2017). Subordination: Providing resilience to simultaneous failure of multiple cluster nodes. In Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017 (pp. 832-838). [8035165] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/HPCS.2017.126

Vancouver

Gankevich I, Tipikin Y, Korkhov V. Subordination: Providing resilience to simultaneous failure of multiple cluster nodes. In Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 832-838. 8035165 https://doi.org/10.1109/HPCS.2017.126

Author

Gankevich, Ivan ; Tipikin, Yuri ; Korkhov, Vladimir. / Subordination : Providing resilience to simultaneous failure of multiple cluster nodes. Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 832-838

BibTeX

@inproceedings{812ccfeb504441c28533f0251bdcdbb2,
title = "Subordination: Providing resilience to simultaneous failure of multiple cluster nodes",
abstract = "In this paper we describe a new framework for creating distributed programmes which are resilient to cluster node failures. Our main goal is to create a simple and reliable model, that ensures continuous execution of parallel programmes without creation of checkpoints, memory dumps and other I/O intensive activities. To achieve this we introduce multi-layered system architecture, each layer of which consists of unified entities organised into hierarchies, and then show how this system handles different node failure scenarios. We benchmark our system on the example of real-world HPC application on both physical and virtual clusters. The results of the experiments show that our approach has low overhead and scales to a large number of cluster nodes.",
author = "Ivan Gankevich and Yuri Tipikin and Vladimir Korkhov",
year = "2017",
month = sep,
day = "12",
doi = "10.1109/HPCS.2017.126",
language = "English",
pages = "832--838",
booktitle = "Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",
note = "The 2017 International Conference on High Performance Computing and Simulation, HPCS 2017 ; Conference date: 16-07-2017 Through 20-07-2017",
url = "http://hpcs2017.cisedu.info/",

}

RIS

TY - GEN

T1 - Subordination

T2 - The 2017 International Conference on High Performance Computing and Simulation

AU - Gankevich, Ivan

AU - Tipikin, Yuri

AU - Korkhov, Vladimir

N1 - Conference code: 15

PY - 2017/9/12

Y1 - 2017/9/12

N2 - In this paper we describe a new framework for creating distributed programmes which are resilient to cluster node failures. Our main goal is to create a simple and reliable model, that ensures continuous execution of parallel programmes without creation of checkpoints, memory dumps and other I/O intensive activities. To achieve this we introduce multi-layered system architecture, each layer of which consists of unified entities organised into hierarchies, and then show how this system handles different node failure scenarios. We benchmark our system on the example of real-world HPC application on both physical and virtual clusters. The results of the experiments show that our approach has low overhead and scales to a large number of cluster nodes.

AB - In this paper we describe a new framework for creating distributed programmes which are resilient to cluster node failures. Our main goal is to create a simple and reliable model, that ensures continuous execution of parallel programmes without creation of checkpoints, memory dumps and other I/O intensive activities. To achieve this we introduce multi-layered system architecture, each layer of which consists of unified entities organised into hierarchies, and then show how this system handles different node failure scenarios. We benchmark our system on the example of real-world HPC application on both physical and virtual clusters. The results of the experiments show that our approach has low overhead and scales to a large number of cluster nodes.

UR - http://www.scopus.com/inward/record.url?scp=85032352516&partnerID=8YFLogxK

U2 - 10.1109/HPCS.2017.126

DO - 10.1109/HPCS.2017.126

M3 - Conference contribution

AN - SCOPUS:85032352516

SP - 832

EP - 838

BT - Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 16 July 2017 through 20 July 2017

ER -

ID: 9152966