The flagship project involved the design and implementation of an industry-grade fault-tolerant distributed system, with heartbeats, distributed consensus, total ordering, checkpointing, and logging to provide strong consistency for a distributed replicated application. This project involved supporting different replication styles (active, or hot-swap replication, as well as passive, or primary-backup replication), along with mechanisms to ensure no downtime even as faults are injected.
Run command:
python replicate_manager.py
Run command:
python global_fault_detector.py
- Take care of line 8, rm_ip before running the code
- Run on the same machine with RM and GFD
- Multiple clients can be launched with different <client_id> in the run command.
Run command:
python client.py <client_id>
- Change line 12, gfd_ip_address to the Machine-1 IP
- Run on the same machine with its server
python local_fault_detector.py
Run command:
python server.py
- Launch the RM
- Launch the GFD
- Launch LFD-1 and Server-1
- Launch LFD-2 and Server-2
- Launch LFD-3 and Server-3
------ End of Fault-free Testing ------
------ Start Fault Testing ------
- Kill one of the server
- Wait for some time
- Bring back the dead server
- Clients and the other two server should work normally and consistently during these steps and the membership changes should be broadcasted to all clients and existing servers
- Launch the RM
- Launch the GFD
- Launch LFD-1 and Server-1 (Primary)
- Launch LFD-2 and Server-2 (back-up 1)
- Launch LFD-3 and Server-3 (back-up 2)
------ End of Fault-free Testing ------
------ Start Fault Testing ------
-
Kill one of the back-up server
-
Wait for some time
-
Bring back the dead back-up server
-
Kill the primary server
-
Wait for some time
-
Bring back the dead primary server (it should become back-up now)
Clients and the other two server should work normally and consistently during these steps and the membership changes should be broadcasted to all clients and existing servers