DOI

Efficient management of a distributed system is a common problem for university’s and commercial computer centres, and handling node failures is a major aspect of it. Failures which are rare in a small commodity cluster, at large scale become common, and there should be a way to overcome them without restarting all parallel processes of an application. The efficiency of existing methods can be improved by forming a hierarchy of distributed processes. That way only lower levels of the hierarchy need to be restarted in case of a leaf node failure, and only root node needs special treatment. Process hierarchy changes in real time and the workload is dynamically rebalanced across online nodes. This approach makes it possible to implement efficient partial restart of a parallel application, and transactional behaviour for computer centre service tasks.
Язык оригиналаанглийский
Название основной публикацииComputational Science and Its Applications - ICCSA 2015
Подзаголовок основной публикации15th International Conference, Banff, AB, Canada, June 22-25, 2015, Proceedings, Part IV
ИздательSpringer Nature
Страницы259-271
ISBN (электронное издание)978-3-319-21410-8
ISBN (печатное издание)978-3-319-21409-2
DOI
СостояниеОпубликовано - 2015
Событие15th International Conference on Computational Science and Its Applications, ICCSA 2015 - Banff, Канада
Продолжительность: 21 июн 201524 июн 2015

Серия публикаций

НазваниеLecture Notes in Computer Science
ИздательSpringer Nature
Том9158
ISSN (печатное издание)0302-9743

конференция

конференция15th International Conference on Computational Science and Its Applications, ICCSA 2015
Страна/TерриторияКанада
ГородBanff
Период21/06/1524/06/15

ID: 71354892