Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid-Sleep/Hibernation might break scheduling #172

Open
ArthurusDent opened this issue Mar 22, 2024 · 39 comments
Open

Hybrid-Sleep/Hibernation might break scheduling #172

ArthurusDent opened this issue Mar 22, 2024 · 39 comments
Labels
bug Something isn't working p1

Comments

@ArthurusDent
Copy link

Describe the bug
It seems that after entering hybrid-sleep on my two Kubuntu 23.10 machines and turning them back on, backup scheduling stops to work at some point.

After a reboot, scheduling works as expected again.

To Reproduce
I'm not entirely sure right now.

My machines are configured to enter hybrid-sleep after a certain amount of time. This means they enter hybrid-sleep for various amounts of time. Either when I take a longer break, like lunch, or when I don't return to work and cut the energy supply over night.

I don't know whether breaking of scheduling is dependent on how long the machine was off.

I have seen repeatedly that my two hourly backup plans (local and remote) often run once in the morning when I start the machines again and a next run for the backup plan is being scheduled but it never gets executed (see screenshot).

I'm using KDE's power management but if you need to test, those commands work on my system:
$ sudo systemctl suspend
$ sudo systemctl hibernate
$ sudo systemctl hybrid-sleep

Expected behavior
Backrest should not be thrown off by any energy savings measures.

Screenshots
image

Desktop (please complete the following information):

  • OS: Kubuntu 23.10
  • Browser Firefox 123.0
  • Backrest Version 0.15.1

Additional context
backrest-scheduling.log

@ArthurusDent ArthurusDent added the bug Something isn't working label Mar 22, 2024
@garethgeorge
Copy link
Owner

garethgeorge commented Mar 22, 2024

Hey, thanks for the report. I was a bit worried about sleep behavior when writing the scheduler. Backrest is configured to poll the system clock every 15 minutes to account for time jumps (which is how a sleep / wake looks to the timers). But it sounds like you're saying that after a sleep it becomes hung and simply never runs any more operations?

Just to confirm -- the bug isn't simply a timer overrun but scheduling stops permanently? The screenshot doesn't make the time of the screenshot clear so it's a bit hard to tell how much it's overrun by. It'd also be great to know the rough times of sleep and wakeup so I can cross reference with the log. Lastly, I'm only seeing scheduling events in the log but not the task execution? Are those filtered out?

@ArthurusDent
Copy link
Author

Hey, I looked at my second system yesterday and to be honest, to me the scheduling/execution behavior looks kind of erratic. I can't really predict when it will stop. It doesn't always seem to stop after resuming from hybrid-sleep. It also seems to start again sometimes but not always.

The screenshot is from 22-03-24, so backrest was stuck since 20-03-24.

Sorry, the logs are incomplete. I experimented with different commands for extraction and sent an incomplete one. Here is a just slightly edited version that includes backrest + hybrid-sleep + system start events, so you can tell when backrest started fresh and worked again.

Command:
cat /var/log/syslog | grep 'hybrid-sleep\|backrest\|Linux version' > backrest_hybrid-sleep_system-start.log

Output:
backrest_hybrid-sleep_system-start.log

@garethgeorge
Copy link
Owner

garethgeorge commented Mar 23, 2024

Ah, I see so you resumed from sleep a few times and backrest's scheduling loop looks completely hung. It never resumed scheduling tasks. It looks like you restarted it at which point scheduling resumes.

@ArthurusDent
Copy link
Author

Exactly. If you look at the screenshot and take the last snapshot id c39ef56f (at 2024-03-20T09:36:21.453093+01:00) you see that right before creation the system resumed from hybrid-sleep. Backrest still does run some plans but scheduling doesn't happen any more and consequently no further plans are being executed.

Then at 2024-03-20T15:29:44.999662+01:00 you see that several hybrid-sleeps and resumes happened, which means backrest didn't run over two days and at 2024-03-22T10:45:30.522422+01:00 the system has rebooted and backrest starts scheduling and executing plans again.

@garethgeorge
Copy link
Owner

garethgeorge commented Mar 23, 2024

Okay, I've put out a new commit (5b7e2b0) which increases the polling frequency and ƒixes a missing log statement which will print whether the tasks are being rescheduled or not (right now we only see when a task runs but not if it was enqueued). If you have the ability to build backrest from source, you might try running this for a few days to see if it catches any scheduling mishap.

I increased the polling interval to every 5 minutes b/c honestly 15 is quite long. I see a few blocks of time where it looks like the computer woke up for very short durations and then was suspended again in these logs:

2024-03-20T09:51:33.934193+01:00 Hostname backrest[1663]: 2024-03-20T09:51:33.934+0100#011#033[34mINFO#033[0m#011scheduling task#011{"task": "forget for plan \"System-Raspi2\"", "runAt": "2024-03-20T09:51:33+01:00"}
2024-03-20T09:51:33.934243+01:00 Hostname backrest[1663]: 2024-03-20T09:51:33.934+0100#011#033[34mINFO#033[0m#011scheduling task#011{"task": "index snapshots for plan \"restic-backrest-raspi2\"", "runAt": "2024-03-20T09:51:33+01:00"}
2024-03-20T09:51:33.940197+01:00 Hostname backrest[1663]: 2024-03-20T09:51:33.940+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "backup for plan \"System-Raspi2\"", "duration": "1m33.842470459s"}
2024-03-20T09:51:33.946185+01:00 Hostname backrest[1663]: 2024-03-20T09:51:33.946+0100#011#033[34mINFO#033[0m#011running task#011{"task": "forget for plan \"System-Raspi2\""}
2024-03-20T09:51:34.775294+01:00 Hostname backrest[1663]: 2024-03-20T09:51:34.775+0100#011#033[35mDEBUG#033[0m#011Forgot snapshots#011{"plan": "System-Raspi2", "count": 0, "policy": "keep_hourly:24  keep_daily:7  keep_weekly:6  keep_monthly:12  keep_yearly:1"}
2024-03-20T09:51:34.781519+01:00 Hostname backrest[1663]: 2024-03-20T09:51:34.781+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "forget for plan \"System-Raspi2\"", "duration": "835.323148ms"}
2024-03-20T09:51:34.781574+01:00 Hostname backrest[1663]: 2024-03-20T09:51:34.781+0100#011#033[34mINFO#033[0m#011running task#011{"task": "index snapshots for plan \"restic-backrest-raspi2\""}
2024-03-20T09:51:35.592531+01:00 Hostname backrest[1663]: 2024-03-20T09:51:35.592+0100#011#033[35mDEBUG#033[0m#011Indexed known (and not forgotten) snapshot IDs for plan restic-backrest-raspi2 in 911.649µs
2024-03-20T09:51:35.599899+01:00 Hostname backrest[1663]: 2024-03-20T09:51:35.599+0100#011#033[35mDEBUG#033[0m#011Indexed snapshots#011{"repo": "restic-backrest-raspi2", "duration": "7.683602ms", "alreadyIndexed": 163, "newlyAdded": 1, "markedForgotten": 0}
2024-03-20T09:51:35.599982+01:00 Hostname backrest[1663]: 2024-03-20T09:51:35.599+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "index snapshots for plan \"restic-backrest-raspi2\"", "duration": "818.363871ms"}
2024-03-20T15:29:44.999662+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate...
2024-03-20T15:29:45.189020+01:00 Hostname systemd-sleep[52379]: Entering sleep state 'hybrid-sleep'...
2024-03-21T09:28:47.215540+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Deactivated successfully.
2024-03-21T09:28:47.215632+01:00 Hostname systemd[1]: Finished systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate.
2024-03-21T09:28:47.215669+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Consumed 6.321s CPU time.
2024-03-21T09:28:47.215871+01:00 Hostname systemd[1]: Reached target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T09:28:47.251374+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T14:58:59.675155+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate...
2024-03-21T14:58:59.901816+01:00 Hostname systemd-sleep[76676]: Entering sleep state 'hybrid-sleep'...
2024-03-21T19:26:06.004831+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Deactivated successfully.
2024-03-21T19:26:06.004957+01:00 Hostname systemd[1]: Finished systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate.
2024-03-21T19:26:06.005566+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Consumed 8.359s CPU time.
2024-03-21T19:26:06.005640+01:00 Hostname systemd[1]: Reached target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T19:26:06.043882+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T19:32:22.478564+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate...
2024-03-21T19:32:22.761336+01:00 Hostname systemd-sleep[77305]: Entering sleep state 'hybrid-sleep'...
2024-03-22T09:08:25.066256+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Deactivated successfully.
2024-03-22T09:08:25.066442+01:00 Hostname systemd[1]: Finished systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate.
2024-03-22T09:08:25.066492+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Consumed 4.275s CPU time.
2024-03-22T09:08:25.066799+01:00 Hostname systemd[1]: Reached target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-22T09:08:25.090609+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-22T10:45:04.439577+01:00 Hostname systemd[1]: Stopping backrest.service - Backrest Service...
2024-03-22T10:45:04.442812+01:00 Hostname backrest[1663]: 2024-03-22T10:45:04.442+0100#011#033[34mINFO#033[0m#011shutting down orchestrator loop, context cancelled.
2024-03-22T10:45:04.442862+01:00 Hostname backrest[1663]: 2024-03-22T10:45:04.442+0100#011#033[34mINFO#033[0m#011HTTP gateway shutdown

(Let me know if I'm reading that incorrectly)

The fact that backrest prints "shutting down orchestrator loop, context cancelled." means the orchestrator loop was still running after suspend. It's possible that, at least during that block of time, the computer was never awake long enough for backrest to detect the system wakeup and run a task.

@ArthurusDent
Copy link
Author

Ok, cool! I'll try to find out how to build go projects from source.

There is only one gap that would be too small to fit into the 15 min window:

2024-03-21T19:26:06.043882+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T19:32:22.478564+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate...

All the other gaps are at least one hour long.

This I found weird. The system had resumed from hybrid-standby, backrest had executed the plans but then backrest suddenly stops working for almost six hours until the system enters hybrid-sleep again:

2024-03-20T09:51:35.599982+01:00 Hostname backrest[1663]: 2024-03-20T09:51:35.599+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "index snapshots for plan \"restic-backrest-raspi2\"", "duration": "818.363871ms"}
2024-03-20T15:29:44.999662+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate...

@ArthurusDent
Copy link
Author

Ok, looks like I managed to build the binary and installed it correctly. Will start to monitor over the next few days.

@garethgeorge
Copy link
Owner

garethgeorge commented Mar 25, 2024

Hey, thanks for giving this a go! I was going to say that I'd also aim to get a release out this week with the logs, but seeing as you got it building if there are any insights from new logs I'd be really happy to try to patch this for the next release.

So far I've spent some time staring at the scheduling logic but I'm not quite sure what could be going on w/your suspend / resume case. It'll be very telling if we see that tasks are getting enqueued in the scheduler but just never fire after the suspend, that'll confirm something's wrong with how timers are firing.

@ArthurusDent
Copy link
Author

ArthurusDent commented Mar 25, 2024

It's currently 11:00 PM/23:00 on 2024-03-25 and backrest has stopped working.

I have two systems where this has happened with the dev-snapshot in place but this post is only about System1. System2 I might post tomorrow.

backrest-System1.log

System1 most likely has been using the dev-snapshot from this point in time on:

2024-03-24T15:57:29.332245+01:00 System1 systemd[1]: Started backrest.service - Backrest Service.

System1 resumed at 3:43 PM:

2024-03-25T15:43:10.883609+01:00 System1 kernel: [69032.988514] PM: hibernation: hibernation exit

Backrest then immediately ran the two Home-plans that it might have scheduled the evening before but wasn't able to run because the system entered hybrid-sleep before they could be executed. After both of those plans ran, neither were the hourly Home plans executed again, nor were the two daily System plans being executed (screenshots/logs).

Schedule:
Home-Local: 00:06 (hourly)
Home-RaspberryPi2: 00:16 (hourly)
System-Local: 10:26 PM/22:26 (daily)
System-RaspberryPi2: 10:46 PM/22:46 (daily)

I haven't looked at the logs very thoroughly, though. Maybe you can still find something interesting.

grafik

grafik

grafik

grafik

@garethgeorge
Copy link
Owner

garethgeorge commented Mar 26, 2024

Okay, my leading theory at this point is some kind of deadlock in the scheduler. Thanks for providing all of those logs, they're definitely illuminating. The tasks are all getting scheduled which means the scheduler itself is becoming hung at somepoint. I wonder if related to the hybernation or if it's just coincidental -- it may actually be triggered by the time jump.

Try starting backrest with GOTRACEBACK=all set as an env variable, next time you see it hung terminate with SIGQUIT and let's try to get a dump of the inflight coroutines. https://stackoverflow.com/questions/19094099/how-to-dump-goroutine-stacktraces shows a bit of how to do this.

In the mean time, I'm going to try to add some new tests that might cover the situation you're running into (and also see if I can replicate by just suspending a backrest process for some duration of time).

Thanks again for working with on me to investigate here!

@ArthurusDent
Copy link
Author

Happy to help!

I'll take a look at enabling traceback.

Here is something unexpected that happened yesterday on System2. Initially it was also stuck but then the auth cookie seemed to have expired and I had to re-login. This made backrest unstuck and it is still executing the backup plans the next day after another hybrid-sleep.

2024-03-26T08:59:29.785751+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-26T10:42:27.127755+01:00 Hostname backrest[8892]: 2024-03-26T10:42:27.127+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:29.573821+01:00 Hostname backrest[8892]: 2024-03-26T10:42:29.573+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:29.921904+01:00 Hostname backrest[8892]: 2024-03-26T10:42:29.921+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:29.958888+01:00 Hostname backrest[8892]: 2024-03-26T10:42:29.958+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:30.075747+01:00 Hostname backrest[8892]: 2024-03-26T10:42:30.075+0100#011#033[35mDEBUG#033[0m#011login request#011{"username": "default"}
2024-03-26T10:42:30.075966+01:00 Hostname backrest[8892]: 2024-03-26T10:42:30.075+0100#011#033[33mWARN#033[0m#011failed login attempt#011{"error": "user not found"}
2024-03-26T10:42:34.919477+01:00 Hostname backrest[8892]: 2024-03-26T10:42:34.918+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:39.914640+01:00 Hostname backrest[8892]: 2024-03-26T10:42:39.914+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:45.063569+01:00 Hostname backrest[8892]: 2024-03-26T10:42:45.063+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:49.771088+01:00 Hostname backrest[8892]: 2024-03-26T10:42:49.770+0100#011#033[35mDEBUG#033[0m#011login request#011{"username": "xxx"}
2024-03-26T10:43:36.837565+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.836+0100#011#033[34mINFO#033[0m#011running task#011{"task": "collect garbage"}
2024-03-26T10:43:36.850289+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011collecting garbage#011{"forgotten_snapshots": 81, "operations_removed": 0}
2024-03-26T10:43:36.850342+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "collect garbage", "duration": "13.289862ms"}
2024-03-26T10:43:36.850368+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011scheduling task#011{"task": "collect garbage", "runAt": "2024-03-27T10:43:36+01:00"}
2024-03-26T10:43:36.850392+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011running task#011{"task": "backup for plan \"Home-Raspi2\""}
2024-03-26T10:43:36.856323+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.856+0100#011#033[34mINFO#033[0m#011Starting backup#011{"plan": "Home-Raspi2", "opId": 1794443628301517874}
2024-03-26T10:43:36.856371+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.856+0100#011#033[35mDEBUG#033[0m#011repo orchestrator starting backup#011{"repo": "restic-backrest-raspi2"}

backrest_hibernation_system2.log

@ArthurusDent
Copy link
Author

ArthurusDent commented Mar 31, 2024

Looks like I was able to enable tracebacks but your suggestion with the environment variable needed a bit extra work. The environment variable needed to be added with systemctl edit backrest.service.

Content:

[Service]
Environment="GOTRACEBACK=all"

If I then kill backrest with kill -SIGQUIT %jobID it doesn't quit with "core dump" but it seems to do write something like it into the log file.

Please take a look if this is what you needed. SIGQUIT happened when backrest had gotten stuck.
backrest_stuckt_with-traceback.log

@garethgeorge
Copy link
Owner

Thanks for posting this log! That's fantastic. Looks like it's blocked on Dequeue in the orchestrator, what's interesting is that the queue has only been stuck for ~3 minutes in that log (which is the polling interval) which implies to me that it's still looping but it's not getting any tasks from the queue. This doesn't make much sense seeing as when you post screenshots there's clearly a task scheduled that should also be present in the operation queue. Going to see if I can figure out how that might be happening.

@benallfree
Copy link

Tracking this on OSX as well. #189

@garethgeorge
Copy link
Owner

Significantly simplified the task queue implementation in 8b9280e , the refactoring generalizes the implementation and breaks it up for more modular test coverage as the previous implementation was very hard to understand.

Hopeful this (and other improvements around scheduling) will have some impact here.

@ArthurusDent
Copy link
Author

Ok, cool! Let me know when you need me to re-build and test.

@garethgeorge
Copy link
Owner

garethgeorge commented Apr 12, 2024

snip: found some significant bugs in the revised orchestrator, reverting it for now. Going to keep iterating on this.

@garethgeorge
Copy link
Owner

Released the new task queue implementation in 0.17.2 which fixes some causes of deadlock that existed in the original implementation. I'd be interested to have you give it a go @ArthurusDent and let me know if you're still seeing backrest stop scheduling.

@benallfree
Copy link

@garethgeorge FYI I'm testing this on OSX as well :)

@ArthurusDent
Copy link
Author

Released the new task queue implementation in 0.17.2 which fixes some causes of deadlock that existed in the original implementation. I'd be interested to have you give it a go @ArthurusDent and let me know if you're still seeing backrest stop scheduling.

Ok, going to test on my two systems. Results will need a few days.

@benallfree
Copy link

@garethgeorge After some OSX testing, I can report that it still fails to resume the schedule after sleep. Back up now also has no effect. Restarting the daemon process fixes everything, and a normal SIGINT works (not Force Quit).

After restarting, the previously queued waiting items were cleared and new ones were created. Then everything started working normally.

@ArthurusDent
Copy link
Author

Same thing here, backrest still got stuck after a few days. Problem isn't fixed.

@garethgeorge
Copy link
Owner

Darn, that's disappointing. If either of you get more logs (or traces to post) that'll continue to be helpful. One "cludge" (in my opinion) work around would be for me to add a monitor thread that configures backrest to restart itself every 24 hours or so (if no operations have run recently) -- but I feel it should be possible to make the scheduling robust enough that that isn't required.

I'd like to dig into a few more logs samples. The last few releases have narrowed down the possibilities re: what could be going on here and have added new logs that should give me a better understanding of what the scheduler is doing.

@benallfree
Copy link

@garethgeorge I'm happy to provide OSX logs if you can guide me to it 👍

Come to think of it, some sort of log copy/paste from the UI would be a great feature. (Ideally stripping out sensitive info). That way you could add whatever logging you need and users could send it to you.

@ArthurusDent
Copy link
Author

Here is the log from my work PC:
backrest_hybrid-sleep-v0.17.2.log

@kytta
Copy link

kytta commented Jun 3, 2024

Hey there; I've just started using Backrest v1.1.0 on macOS 13.6 Ventura (via Homebrew), and I think I also have this issue. A backend of mine scheduled to run every hour did not run while the laptop was asleep, and it didn't run after being woken up. The operation would say, e.g., "8:00 pending", even if it's long after 8. "Editing" the settings (pressing Save without changes) or restarting Backrest help.

@garethgeorge you said you need logs and stuff; will the default log be enough? Do you want me to compile backrest with some debug options?

@nbca2
Copy link

nbca2 commented Jun 18, 2024

Hey there; I've just started using Backrest v1.1.0 on macOS 13.6 Ventura (via Homebrew), and I think I also have this issue. A backend of mine scheduled to run every hour did not run while the laptop was asleep, and it didn't run after being woken up. The operation would say, e.g., "8:00 pending", even if it's long after 8. "Editing" the settings (pressing Save without changes) or restarting Backrest help.

Hey there, i also have the same problem described above!

@buddydvd
Copy link

This issue is likely related #315.

My workaround is to close all browser tabs to Backrest before letting the computer go to sleep.

@garethgeorge
Copy link
Owner

Thanks for the callout and agree re: #315 likely related, it's worth note that this is fixed as of version 1.2.0 .

@ArthurusDent
Copy link
Author

I've been running 1.5.0 for at least a week now and it didn't improve scheduling on my system. Backup jobs sometimes don't run for several days before they suddenly are being executed and start running according to plan again.

Do you need new logs and a stacktrace?

@amarendra
Copy link

I am on 1.51 and this is still happening. The scheduled time is a day and a half ago and my guess is since that time will never hit now there is no other scheduled backup in the queue. It's the same for prune and check as well. It seems because I have selected Interval Days and Hours for scheduling.

@garethgeorge
Copy link
Owner

Sorry for the late reply— on vacation at the moment, but want to dig into this when I’m back (perhaps by adding a watch dog thread / some recovery— but still wanting to understand if there is a more fundamental issue with the scheduling code here).

New stack traces would be hugely helpful (following the steps earlier in this thread) if you are able to collect them.

@garethgeorge garethgeorge added the p1 label Oct 7, 2024
@amarendra
Copy link

New stack traces would be hugely helpful

Of course. I will try.
I guess I will have to wait this to happen again or I will try to find logs from around that time in log backup (is there an easy way to clean directory/file names from the log - or is there a log that has no such info at all?)

@ArthurusDent
Copy link
Author

New stack trace from version 1.5.0, I hope I caught the right moment:
backrest_tracelog_16-10-24.log

@garethgeorge
Copy link
Owner

Thanks for the logs -- appreciated. Unfortunately, I'll admit I'm still mystified why this pops up specifically with hybrid sleep / hibernation.

As an attempt at patching this / the related issues of tasks being scheduled at the wrong time when the system clock changes I've added 66a5241 which re-schedules all tasks when the clock is detected to have changed (by simple heuristic). Something is clearly getting into a bad state as a result of the time jump / timers being suspended for a long period of time, I'm hopeful this will patch the problem.

Similarly, #221 will mean re-implementing task scheduling to use a pool of timers (one per task) rather than a global queue. Knowing I'll re-implement it means I'll probably hold off on investing more on debugging the queues in favor of pulling them out and replacing them for this issue.

@garethgeorge
Copy link
Owner

Pinging on this bug after a bit, curious if you’ve been trying the latest version and if you’ve seen any of the hung scheduling?

@amarendra
Copy link

Pinging on this bug after a bit, curious if you’ve been trying the latest version and if you’ve seen any of the hung scheduling?

I have just checked upto Sep 22 and haven't seen any scheduled backup job from past stuck. Even the one I had seen (which I had commented about above) around 24-25 Sep is gone.

@ArthurusDent
Copy link
Author

I still need a bit more time to test. I've just upgraded to Kubuntu 24.10 and my system now sometimes crashes when entering standby. I don't suspect backrest being the culprit but for real testing, my system needs to run for maybe three days with hibernating and sleeping in between, which has been difficult to achieve.

@ArthurusDent
Copy link
Author

Hey, sorry for taking so long. Had to revert to Kubuntu 24.04 because of many issues, which took some time.

I've tested v1.6.1 and my system has been up for almost three days. Backrest is still executing plans with no problem so to me it looks like the new scheduler works a lot better! I will keep monitoring its behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p1
Projects
None yet
Development

No branches or pull requests

7 participants