Subordination › Научные исследования в СПбГУ

DOI

https://doi.org/10.1109/HPCS.2017.126
Конечная издательская версия

In this paper we describe a new framework for creating distributed programmes which are resilient to cluster node failures. Our main goal is to create a simple and reliable model, that ensures continuous execution of parallel programmes without creation of checkpoints, memory dumps and other I/O intensive activities. To achieve this we introduce multi-layered system architecture, each layer of which consists of unified entities organised into hierarchies, and then show how this system handles different node failure scenarios. We benchmark our system on the example of real-world HPC application on both physical and virtual clusters. The results of the experiments show that our approach has low overhead and scales to a large number of cluster nodes.

Язык оригинала	английский
Название основной публикации	Proceedings - 2017 International Conference on High Performance Computing and Simulation, HPCS 2017
Издатель	Institute of Electrical and Electronics Engineers Inc.
Страницы	832-838
Число страниц	7
ISBN (электронное издание)	9781538632505
DOI	https://doi.org/10.1109/HPCS.2017.126
Состояние	Опубликовано - 12 сен 2017
Событие	The 2017 International Conference on High Performance Computing and Simulation - Genoa, Италия Продолжительность: 16 июл 2017 → 20 июл 2017 Номер конференции: 15 http://hpcs2017.cisedu.info/

конференция

конференция	The 2017 International Conference on High Performance Computing and Simulation
Сокращенное название	HPCS 2017
Страна/Tерритория	Италия
Город	Genoa
Период	16/07/17 → 20/07/17
Сайт в сети Internet	http://hpcs2017.cisedu.info/

Предметные области Scopus

Прикладные компьютерные науки
Информационные системы и управление
Моделирование и симуляция
Компьютерные сети и коммуникации
Компьютерные науки (разное)

ID: 9152966

Subordination: Providing resilience to simultaneous failure of multiple cluster nodes

DOI

конференция

Предметные области Scopus