Abstract

Distributed computing clusters are often built with commodity hardware which leads to periodic failures of processing nodes due to relatively low reliability of such hardware. While worker node fault-tolerance is straightforward, fault tolerance of master node poses a bigger challenge. In this paper master node failure handling is based on the concept of master and worker roles that can be dynamically re-assigned to cluster nodes along with maintaining a backup of the master node state on one of worker nodes. In such case no special component is needed to monitor the health of the cluster while master node failures can be resolved except for the cases of simultaneous failure of master and backup. We present experimental evaluation of the technique implementation, show benchmarks demonstrating that a failure of a master does not affect running job, and a failure of backup results in re-computation of only the last job step.
Original languageEnglish
Pages (from-to)158-172
JournalInternational Journal of Business Intelligence and Data Mining
Volume15
Issue number2
Early online date31 May 2019
DOIs
Publication statusPublished - 2019

Cite this

@article{2a1e77a7c7454f5dad7af8b5f3ea8435,
title = "Master node fault tolerance in distributed big data processing clusters",
abstract = "Distributed computing clusters are often built with commodity hardware which leads to periodic failures of processing nodes due to relatively low reliability of such hardware. While worker node fault-tolerance is straightforward, fault tolerance of master node poses a bigger challenge. In this paper master node failure handling is based on the concept of master and worker roles that can be dynamically re-assigned to cluster nodes along with maintaining a backup of the master node state on one of worker nodes. In such case no special component is needed to monitor the health of the cluster while master node failures can be resolved except for the cases of simultaneous failure of master and backup. We present experimental evaluation of the technique implementation, show benchmarks demonstrating that a failure of a master does not affect running job, and a failure of backup results in re-computation of only the last job step.",
keywords = "parallel computing; Big Data processing; distributed computing; backup node; state transfer; delegation; cluster computing; fault-tolerance",
author = "Ivan Gankevich and Yury Tipikin and Vladimir Korkhov and Vladimir Gaiduchok and Alexander Degtyarev and A. Bogdanov",
year = "2019",
doi = "10.1504/IJBIDM.2017.10007764",
language = "English",
volume = "15",
pages = "158--172",
journal = "International Journal of Business Intelligence and Data Mining",
issn = "1743-8187",
publisher = "Inderscience Enterprises Ltd.",
number = "2",

}

TY - JOUR

T1 - Master node fault tolerance in distributed big data processing clusters

AU - Gankevich, Ivan

AU - Tipikin, Yury

AU - Korkhov, Vladimir

AU - Gaiduchok, Vladimir

AU - Degtyarev, Alexander

AU - Bogdanov, A.

PY - 2019

Y1 - 2019

N2 - Distributed computing clusters are often built with commodity hardware which leads to periodic failures of processing nodes due to relatively low reliability of such hardware. While worker node fault-tolerance is straightforward, fault tolerance of master node poses a bigger challenge. In this paper master node failure handling is based on the concept of master and worker roles that can be dynamically re-assigned to cluster nodes along with maintaining a backup of the master node state on one of worker nodes. In such case no special component is needed to monitor the health of the cluster while master node failures can be resolved except for the cases of simultaneous failure of master and backup. We present experimental evaluation of the technique implementation, show benchmarks demonstrating that a failure of a master does not affect running job, and a failure of backup results in re-computation of only the last job step.

AB - Distributed computing clusters are often built with commodity hardware which leads to periodic failures of processing nodes due to relatively low reliability of such hardware. While worker node fault-tolerance is straightforward, fault tolerance of master node poses a bigger challenge. In this paper master node failure handling is based on the concept of master and worker roles that can be dynamically re-assigned to cluster nodes along with maintaining a backup of the master node state on one of worker nodes. In such case no special component is needed to monitor the health of the cluster while master node failures can be resolved except for the cases of simultaneous failure of master and backup. We present experimental evaluation of the technique implementation, show benchmarks demonstrating that a failure of a master does not affect running job, and a failure of backup results in re-computation of only the last job step.

KW - parallel computing; Big Data processing; distributed computing; backup node; state transfer; delegation; cluster computing; fault-tolerance

UR - http://www.mendeley.com/research/master-node-fault-tolerance-distributed-big-data-processing-clusters

U2 - 10.1504/IJBIDM.2017.10007764

DO - 10.1504/IJBIDM.2017.10007764

M3 - Article

VL - 15

SP - 158

EP - 172

JO - International Journal of Business Intelligence and Data Mining

JF - International Journal of Business Intelligence and Data Mining

SN - 1743-8187

IS - 2

ER -