-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hybrid-Sleep/Hibernation might break scheduling #172
Comments
Hey, thanks for the report. I was a bit worried about sleep behavior when writing the scheduler. Backrest is configured to poll the system clock every 15 minutes to account for time jumps (which is how a sleep / wake looks to the timers). But it sounds like you're saying that after a sleep it becomes hung and simply never runs any more operations? Just to confirm -- the bug isn't simply a timer overrun but scheduling stops permanently? The screenshot doesn't make the time of the screenshot clear so it's a bit hard to tell how much it's overrun by. It'd also be great to know the rough times of sleep and wakeup so I can cross reference with the log. Lastly, I'm only seeing scheduling events in the log but not the task execution? Are those filtered out? |
Hey, I looked at my second system yesterday and to be honest, to me the scheduling/execution behavior looks kind of erratic. I can't really predict when it will stop. It doesn't always seem to stop after resuming from hybrid-sleep. It also seems to start again sometimes but not always. The screenshot is from 22-03-24, so backrest was stuck since 20-03-24. Sorry, the logs are incomplete. I experimented with different commands for extraction and sent an incomplete one. Here is a just slightly edited version that includes backrest + hybrid-sleep + system start events, so you can tell when backrest started fresh and worked again. Command: |
Ah, I see so you resumed from sleep a few times and backrest's scheduling loop looks completely hung. It never resumed scheduling tasks. It looks like you restarted it at which point scheduling resumes. |
Exactly. If you look at the screenshot and take the last snapshot id c39ef56f (at 2024-03-20T09:36:21.453093+01:00) you see that right before creation the system resumed from hybrid-sleep. Backrest still does run some plans but scheduling doesn't happen any more and consequently no further plans are being executed. Then at 2024-03-20T15:29:44.999662+01:00 you see that several hybrid-sleeps and resumes happened, which means backrest didn't run over two days and at 2024-03-22T10:45:30.522422+01:00 the system has rebooted and backrest starts scheduling and executing plans again. |
Okay, I've put out a new commit (5b7e2b0) which increases the polling frequency and ƒixes a missing log statement which will print whether the tasks are being rescheduled or not (right now we only see when a task runs but not if it was enqueued). If you have the ability to build backrest from source, you might try running this for a few days to see if it catches any scheduling mishap. I increased the polling interval to every 5 minutes b/c honestly 15 is quite long. I see a few blocks of time where it looks like the computer woke up for very short durations and then was suspended again in these logs:
(Let me know if I'm reading that incorrectly) The fact that backrest prints "shutting down orchestrator loop, context cancelled." means the orchestrator loop was still running after suspend. It's possible that, at least during that block of time, the computer was never awake long enough for backrest to detect the system wakeup and run a task. |
Ok, cool! I'll try to find out how to build go projects from source. There is only one gap that would be too small to fit into the 15 min window: 2024-03-21T19:26:06.043882+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T19:32:22.478564+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate... All the other gaps are at least one hour long. This I found weird. The system had resumed from hybrid-standby, backrest had executed the plans but then backrest suddenly stops working for almost six hours until the system enters hybrid-sleep again: 2024-03-20T09:51:35.599982+01:00 Hostname backrest[1663]: 2024-03-20T09:51:35.599+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "index snapshots for plan \"restic-backrest-raspi2\"", "duration": "818.363871ms"}
2024-03-20T15:29:44.999662+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate... |
Ok, looks like I managed to build the binary and installed it correctly. Will start to monitor over the next few days. |
Hey, thanks for giving this a go! I was going to say that I'd also aim to get a release out this week with the logs, but seeing as you got it building if there are any insights from new logs I'd be really happy to try to patch this for the next release. So far I've spent some time staring at the scheduling logic but I'm not quite sure what could be going on w/your suspend / resume case. It'll be very telling if we see that tasks are getting enqueued in the scheduler but just never fire after the suspend, that'll confirm something's wrong with how timers are firing. |
It's currently 11:00 PM/23:00 on 2024-03-25 and backrest has stopped working. I have two systems where this has happened with the dev-snapshot in place but this post is only about System1. System2 I might post tomorrow. System1 most likely has been using the dev-snapshot from this point in time on: 2024-03-24T15:57:29.332245+01:00 System1 systemd[1]: Started backrest.service - Backrest Service. System1 resumed at 3:43 PM: 2024-03-25T15:43:10.883609+01:00 System1 kernel: [69032.988514] PM: hibernation: hibernation exit Backrest then immediately ran the two Home-plans that it might have scheduled the evening before but wasn't able to run because the system entered hybrid-sleep before they could be executed. After both of those plans ran, neither were the hourly Home plans executed again, nor were the two daily System plans being executed (screenshots/logs). Schedule: I haven't looked at the logs very thoroughly, though. Maybe you can still find something interesting. |
Okay, my leading theory at this point is some kind of deadlock in the scheduler. Thanks for providing all of those logs, they're definitely illuminating. The tasks are all getting scheduled which means the scheduler itself is becoming hung at somepoint. I wonder if related to the hybernation or if it's just coincidental -- it may actually be triggered by the time jump. Try starting backrest with In the mean time, I'm going to try to add some new tests that might cover the situation you're running into (and also see if I can replicate by just suspending a backrest process for some duration of time). Thanks again for working with on me to investigate here! |
Happy to help! I'll take a look at enabling traceback. Here is something unexpected that happened yesterday on System2. Initially it was also stuck but then the auth cookie seemed to have expired and I had to re-login. This made backrest unstuck and it is still executing the backup plans the next day after another hybrid-sleep. 2024-03-26T08:59:29.785751+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-26T10:42:27.127755+01:00 Hostname backrest[8892]: 2024-03-26T10:42:27.127+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:29.573821+01:00 Hostname backrest[8892]: 2024-03-26T10:42:29.573+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:29.921904+01:00 Hostname backrest[8892]: 2024-03-26T10:42:29.921+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:29.958888+01:00 Hostname backrest[8892]: 2024-03-26T10:42:29.958+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:30.075747+01:00 Hostname backrest[8892]: 2024-03-26T10:42:30.075+0100#011#033[35mDEBUG#033[0m#011login request#011{"username": "default"}
2024-03-26T10:42:30.075966+01:00 Hostname backrest[8892]: 2024-03-26T10:42:30.075+0100#011#033[33mWARN#033[0m#011failed login attempt#011{"error": "user not found"}
2024-03-26T10:42:34.919477+01:00 Hostname backrest[8892]: 2024-03-26T10:42:34.918+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:39.914640+01:00 Hostname backrest[8892]: 2024-03-26T10:42:39.914+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:45.063569+01:00 Hostname backrest[8892]: 2024-03-26T10:42:45.063+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:49.771088+01:00 Hostname backrest[8892]: 2024-03-26T10:42:49.770+0100#011#033[35mDEBUG#033[0m#011login request#011{"username": "xxx"}
2024-03-26T10:43:36.837565+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.836+0100#011#033[34mINFO#033[0m#011running task#011{"task": "collect garbage"}
2024-03-26T10:43:36.850289+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011collecting garbage#011{"forgotten_snapshots": 81, "operations_removed": 0}
2024-03-26T10:43:36.850342+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "collect garbage", "duration": "13.289862ms"}
2024-03-26T10:43:36.850368+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011scheduling task#011{"task": "collect garbage", "runAt": "2024-03-27T10:43:36+01:00"}
2024-03-26T10:43:36.850392+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011running task#011{"task": "backup for plan \"Home-Raspi2\""}
2024-03-26T10:43:36.856323+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.856+0100#011#033[34mINFO#033[0m#011Starting backup#011{"plan": "Home-Raspi2", "opId": 1794443628301517874}
2024-03-26T10:43:36.856371+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.856+0100#011#033[35mDEBUG#033[0m#011repo orchestrator starting backup#011{"repo": "restic-backrest-raspi2"} |
Looks like I was able to enable tracebacks but your suggestion with the environment variable needed a bit extra work. The environment variable needed to be added with Content: [Service]
Environment="GOTRACEBACK=all" If I then kill backrest with Please take a look if this is what you needed. SIGQUIT happened when backrest had gotten stuck. |
Thanks for posting this log! That's fantastic. Looks like it's blocked on Dequeue in the orchestrator, what's interesting is that the queue has only been stuck for ~3 minutes in that log (which is the polling interval) which implies to me that it's still looping but it's not getting any tasks from the queue. This doesn't make much sense seeing as when you post screenshots there's clearly a task scheduled that should also be present in the operation queue. Going to see if I can figure out how that might be happening. |
Tracking this on OSX as well. #189 |
Significantly simplified the task queue implementation in 8b9280e , the refactoring generalizes the implementation and breaks it up for more modular test coverage as the previous implementation was very hard to understand. Hopeful this (and other improvements around scheduling) will have some impact here. |
Ok, cool! Let me know when you need me to re-build and test. |
snip: found some significant bugs in the revised orchestrator, reverting it for now. Going to keep iterating on this. |
Released the new task queue implementation in 0.17.2 which fixes some causes of deadlock that existed in the original implementation. I'd be interested to have you give it a go @ArthurusDent and let me know if you're still seeing backrest stop scheduling. |
@garethgeorge FYI I'm testing this on OSX as well :) |
Ok, going to test on my two systems. Results will need a few days. |
@garethgeorge After some OSX testing, I can report that it still fails to resume the schedule after sleep. After restarting, the previously queued |
Same thing here, backrest still got stuck after a few days. Problem isn't fixed. |
Darn, that's disappointing. If either of you get more logs (or traces to post) that'll continue to be helpful. One "cludge" (in my opinion) work around would be for me to add a monitor thread that configures backrest to restart itself every 24 hours or so (if no operations have run recently) -- but I feel it should be possible to make the scheduling robust enough that that isn't required. I'd like to dig into a few more logs samples. The last few releases have narrowed down the possibilities re: what could be going on here and have added new logs that should give me a better understanding of what the scheduler is doing. |
@garethgeorge I'm happy to provide OSX logs if you can guide me to it 👍 Come to think of it, some sort of log copy/paste from the UI would be a great feature. (Ideally stripping out sensitive info). That way you could add whatever logging you need and users could send it to you. |
Here is the log from my work PC: |
Hey there; I've just started using Backrest v1.1.0 on macOS 13.6 Ventura (via Homebrew), and I think I also have this issue. A backend of mine scheduled to run every hour did not run while the laptop was asleep, and it didn't run after being woken up. The operation would say, e.g., "8:00 pending", even if it's long after 8. "Editing" the settings (pressing Save without changes) or restarting Backrest help. @garethgeorge you said you need logs and stuff; will the default log be enough? Do you want me to compile backrest with some debug options? |
Hey there, i also have the same problem described above! |
This issue is likely related #315. My workaround is to close all browser tabs to Backrest before letting the computer go to sleep. |
Thanks for the callout and agree re: #315 likely related, it's worth note that this is fixed as of version 1.2.0 . |
I've been running 1.5.0 for at least a week now and it didn't improve scheduling on my system. Backup jobs sometimes don't run for several days before they suddenly are being executed and start running according to plan again. Do you need new logs and a stacktrace? |
I am on 1.51 and this is still happening. The scheduled time is a day and a half ago and my guess is since that time will never hit now there is no other scheduled backup in the queue. It's the same for |
Sorry for the late reply— on vacation at the moment, but want to dig into this when I’m back (perhaps by adding a watch dog thread / some recovery— but still wanting to understand if there is a more fundamental issue with the scheduling code here). New stack traces would be hugely helpful (following the steps earlier in this thread) if you are able to collect them. |
Of course. I will try. |
New stack trace from version 1.5.0, I hope I caught the right moment: |
Thanks for the logs -- appreciated. Unfortunately, I'll admit I'm still mystified why this pops up specifically with hybrid sleep / hibernation. As an attempt at patching this / the related issues of tasks being scheduled at the wrong time when the system clock changes I've added 66a5241 which re-schedules all tasks when the clock is detected to have changed (by simple heuristic). Something is clearly getting into a bad state as a result of the time jump / timers being suspended for a long period of time, I'm hopeful this will patch the problem. Similarly, #221 will mean re-implementing task scheduling to use a pool of timers (one per task) rather than a global queue. Knowing I'll re-implement it means I'll probably hold off on investing more on debugging the queues in favor of pulling them out and replacing them for this issue. |
Pinging on this bug after a bit, curious if you’ve been trying the latest version and if you’ve seen any of the hung scheduling? |
I have just checked upto Sep 22 and haven't seen any scheduled backup job from past stuck. Even the one I had seen (which I had commented about above) around 24-25 Sep is gone. |
I still need a bit more time to test. I've just upgraded to Kubuntu 24.10 and my system now sometimes crashes when entering standby. I don't suspect backrest being the culprit but for real testing, my system needs to run for maybe three days with hibernating and sleeping in between, which has been difficult to achieve. |
Hey, sorry for taking so long. Had to revert to Kubuntu 24.04 because of many issues, which took some time. I've tested v1.6.1 and my system has been up for almost three days. Backrest is still executing plans with no problem so to me it looks like the new scheduler works a lot better! I will keep monitoring its behaviour. |
Describe the bug
It seems that after entering hybrid-sleep on my two Kubuntu 23.10 machines and turning them back on, backup scheduling stops to work at some point.
After a reboot, scheduling works as expected again.
To Reproduce
I'm not entirely sure right now.
My machines are configured to enter hybrid-sleep after a certain amount of time. This means they enter hybrid-sleep for various amounts of time. Either when I take a longer break, like lunch, or when I don't return to work and cut the energy supply over night.
I don't know whether breaking of scheduling is dependent on how long the machine was off.
I have seen repeatedly that my two hourly backup plans (local and remote) often run once in the morning when I start the machines again and a next run for the backup plan is being scheduled but it never gets executed (see screenshot).
I'm using KDE's power management but if you need to test, those commands work on my system:
$ sudo systemctl suspend
$ sudo systemctl hibernate
$ sudo systemctl hybrid-sleep
Expected behavior
Backrest should not be thrown off by any energy savings measures.
Screenshots
Desktop (please complete the following information):
Additional context
backrest-scheduling.log
The text was updated successfully, but these errors were encountered: