A Javaspace-Based Framework for Efficient Fault-Tolerant Master-Worker Distributed Applications

Fault-tolerant coordination services have been widely used in distributed applications in cloud environments. Recent years have witnessed the emergence of time-sensitive applications deployed in edge computing environments, which introduces both challenges and opportunities for coordination services. On one hand, coordination services must recover from failures in a timely manner. On the other hand, edge computing employs local networked platforms that can be exploited to achieve timely recovery. In this work, we first identify the limitations of the leader election and recovery protocols underlying Apache ZooKeeper, the prevailing open-source coordination service. To reduce recovery latency from leader failures, we then design RT-Zookeeper with a set of novel features including a fast-convergence election protocol, a quorum channel notification mechanism, and a distributed epoch persistence protocol. We have implemented RT-Zookeeper based on ZooKeeper version 3.5.8. Empirical evaluation shows that RT-ZooKeeper achieves 91% reduction in maximum recovery latency in comparison to ZooKeeper. Furthermore, a case study demonstrates that fast failure recovery in RT-ZooKeeper can benefit a common messaging service like Kafka in terms of message latency.

Download Full-text

Lightweight Fault-tolerant Message Passing System for Parallel and Distributed Applications

International e-Conference of Computer Science 2006 ◽

10.1201/b12168-6 ◽

2007 ◽

pp. 30-33

Keyword(s):

Message Passing ◽

Fault Tolerant ◽

Distributed Applications

Download Full-text

A Probabilistic Fault-Tolerant Recovery Mechanism for Task and Result Certification of Large-Scale Distributed Applications

Advances in Grid and Pervasive Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-01671-4_42 ◽

2009 ◽

pp. 471-482

Author(s):

Rim Chayeh ◽

Christophe Cerin ◽

Mohamed Jemni

Keyword(s):

Large Scale ◽

Fault Tolerant ◽

Distributed Applications ◽

Recovery Mechanism

Download Full-text

Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

Wireless Personal Communications ◽

10.1007/s11277-020-07949-0 ◽

2020 ◽

Author(s):

Priti Kumari ◽

Parmeet Kaur

Keyword(s):

Large Scale ◽

Fault Tolerant ◽

Distributed Applications

Download Full-text

Visual programming of fault-tolerant distributed applications

Proceedings of Symposium on Visual Languages ◽

10.1109/vl.1995.520799 ◽

2002 ◽

Author(s):

B. Muganga ◽

F. Pacull ◽

K.R. Mazouni ◽

A.-D. Wolff

Keyword(s):

Fault Tolerant ◽

Visual Programming ◽

Distributed Applications

Download Full-text

An Adaptable and Generic Fault-Tolerant System for Distributed Applications

2012 International Conference on Advanced Computer Science Applications and Technologies (ACSAT) ◽

10.1109/acsat.2012.63 ◽

2012 ◽

Author(s):

Ouanes Aissaoui ◽

Abdelkrim Amirat ◽

Fadila Atil

Keyword(s):

Fault Tolerant ◽

Distributed Applications ◽

Fault Tolerant System

Download Full-text

STAR: a fault-tolerant system for distributed applications

Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing ◽

10.1109/spdp.1993.395471 ◽

2002 ◽

Cited By ~ 2

Author(s):

P. Sens ◽

B. Folliot

Keyword(s):

Fault Tolerant ◽

Distributed Applications ◽

Fault Tolerant System

Download Full-text

Programming fault-tolerant distributed applications in HOPS

Proceedings CVPR '89: IEEE Computer Society Conference on Computer Vision and Pattern Recognition ◽

10.1109/pccc.1989.37433 ◽

2003 ◽

Author(s):

J. Silverman ◽

T. Raeuchle ◽

H. Madduri

Keyword(s):

Fault Tolerant ◽

Distributed Applications

Download Full-text

Flexible Distributed Workflow Management Systems Design Based on CORBA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.157-158.839 ◽

2012 ◽

Vol 157-158 ◽

pp. 839-842 ◽

Cited By ~ 3

Author(s):

Ya Li ◽

Hai Rui Wang ◽

Xiong Tong ◽

Li Zhang

Keyword(s):

Fault Tolerant ◽

Workflow Management ◽

Systems Design ◽

Distributed Applications ◽

General Purpose ◽

Management Systems ◽

Workflow Management Systems ◽

Workflow Systems ◽

Workflow System ◽

Distributed Components

The paper addresses the problem of flexible Workflow Management Systems (WFMS) in distributed environment. Concerning the serious deficiency of flexibility in the current workflow systems, we describe how our workflow system meets the requirements of interoperability, scalability, flexibility, dependability and adaptability. With an additional route engine, the execution path will be adjusted dynamically according to the execution conditions so as to improve the flexibility and dependability of the system. A dynamic register mechanism of domain engines is introduced to improve the scalability and adaptability of the system. The system is general purpose and open: it has been designed and implemented as a set of CORBA services. The system serves as an example of the use of middleware technologies to provide a fault-tolerant execution environment for long running distributed applications. The system also provides a mechanism for communication of distributed components in order to support inter-organizational WFMS.

Download Full-text

Implementing fault-tolerant distributed applications using objects and multi-coloured actions

Proceedings.,10th International Conference on Distributed Computing Systems ◽

10.1109/icdcs.1990.89273 ◽

2002 ◽

Cited By ~ 12

Author(s):

S.K. Shrivastava ◽

S.M. Wheater

Keyword(s):

Fault Tolerant ◽

Distributed Applications

Download Full-text