Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Multiple Schedulers with semi reliable locking and failover #195

Closed
oxalorg opened this issue Oct 30, 2018 · 11 comments
Closed

Feature: Multiple Schedulers with semi reliable locking and failover #195

oxalorg opened this issue Oct 30, 2018 · 11 comments

Comments

@oxalorg
Copy link
Contributor

oxalorg commented Oct 30, 2018

There have been a lot of issues and PRs referencing this, but no one has got it quite right yet. I would like to discuss the things we need to get Multiple Schedulers running for failover.

The rq set of libraries are amazingly simple and I would love to continue using them. I feel like this might be a deal breaker for a lot of folks in adapting RQ + RQ-Scheduler.

Use Case

The feature I'm most interested in is: Multiple Schedulers running at the same time, but only one scheduler will be active. If the active scheduler dies for whatever reason, an inactive scheduler will become active.

This is a very important feature for us as we're hoping to run the scheduler on multiple servers for a failover. (Also makes our deployment easier as each server stays identical).

Previous Attempts

#143 Seems to be a PR for Multi Schedulers, but it introduces a bug where more than 1 Scheduler won't even start/register itself.

#170 Tries to fix this issue by completely removing the Birth/Death registration which may not be ideal as we no longer have track of all registered schedulers, and who is active at any given moment.

In both the above cases, (on a first glance, but pardon me If I'm wrong) the locking mechanism doesn't seem reliable and may cause multiple schedulers to acquire the lock.

Fix

I would like to propose a fix for these issues, and introduce it as a somewhat reliable feature.

A rough plan I have in mind:

  • Register each scheduler with a unique key, thus keeping track of all registered schedulers
  • Acquire a lock only if no one else has a lock
  • Keep the lock until you die / crash / deregister
  • Other schedulers (who have also registered themselves) will every so often check the lock for expiry.
  • As soon as they find an expired lock, they will attempt to gain a lock. This must be done without race conditions.
  • We don't want to release the lock easily as that will cause other schedulers to become active. That is not really desirable as our main goal is redundancy and failover.

Please let me know if a PR like this would be appreciated (via an Emoji Thumbsup) / please let me know your thoughts on this @selwin

@selwin
Copy link
Contributor

selwin commented Oct 30, 2018

I agree with your proposal. PR welcome :)

@oxalorg
Copy link
Contributor Author

oxalorg commented Oct 30, 2018 via email

@russellballestrini
Copy link

@oxalorg do you have a feature branch I could review?

@oxalorg
Copy link
Contributor Author

oxalorg commented Dec 23, 2018

Hey @russellballestrini I've made multiple schedulers work here: https://github.com/oxalorg/rq-scheduler/tree/feature-multi-schedulers

Although I'm still debating about the way it's implemented. ATM lock isn't held by a single scheduler continously. It's only held for the time when it wants to move jobs to worker queues.

So when running multiple schedulers, any one can acquire a lock and process the queues.

Any thoughts if this is how it should be, or should one scheduler hold the lock until it dies/crashes/quits and THEN another scheduler must be allowed to try and get the lock?

Also, we're not really storing in redis on WHICH SCHEDULER has the lock, which schedulers are registered etc. Those might be of advantage as well.

@russellballestrini
Copy link

russellballestrini commented Dec 23, 2018

I think what you have is great. Letting all the schedulers race for the lock simplifies the problem and also insures that the scheduled job doesn't get missed so long as you have at least one functional scheduler.

@russellballestrini
Copy link

I don't think we need a concept of registering schedulers. They don't need to communicate with each other, they don't need consensus.

@selwin
Copy link
Contributor

selwin commented Apr 2, 2019

I think schedulers still need to be registered for troubleshooting/monitoring purposes.

@mattjegan
Copy link

@oxalorg Is there any update on this at all?

@oxalorg
Copy link
Contributor Author

oxalorg commented Oct 11, 2019

Hey @mattjegan and @selwin I have created a PR for this feature. 😸 Let me know what you guys think!

@mattjegan
Copy link

@oxalorg Thanks, I'll give it a shot when I get the chance.

@oxalorg
Copy link
Contributor Author

oxalorg commented Apr 24, 2020

#212 fixes this Issue and is now merged to master! 🎊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants