-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear Ansible SSH control path on reboot #4366
Conversation
Tested as follows:
On the second run, I get the following error at the very end of the run:
This should resolve for first-time installs, but since the reboot task is not invoked on subsequent installs, it might not completely resolve the issue. |
Maddening. I tested on hardware, three runs of direct SSH and two with SSH over Tor, and couldn't get the error to happen at firewall configuration. |
Now that we're using Ansible 2.7, we may want to look into the reboot module for handling reboots more gracefully. We could also reduce |
Looking at reboot module - works well in most cases but not for the reboot after a switch from direct SSH to SSH-over-Tor, so probably need the old method to hang around for that. |
reboot module works fine. Reducing ControlPersist actually results in more connection errors for me, I'd recommend bumping it up to 1200sec rather than the opposite. Two other points:
|
More timeout fun! the SSH plugin always sets the ConnectTimeout option after setting args defined in the |
@rmol Could you clarify the state of this PR, given recent Ansible changes? Should we still aim to pick it up again in a future sprint? |
@eloquence I think it's worth putting on a sprint, as there are still reports of timeouts at reboots. I'd like to incorporate the reboot module changes @zenmonkeykstop tried, and to see if we can do better than removing the control sockets. |
@rmol reports he's not seen the "waiting for privilege escalating prompt" recently. Let's monitor playbook runs doing QA, and prioritize accordingly. Bumping off sprint but keeping backlog for visibility. |
Sadly @rmol confirms that the issue is still present, so we should still revisit this soon, in the next sprint at the latest. |
Per discussion at sprint planning, let's try to land a minimal version of this that may mitigate, e.g., by clearing the control path after any playbook failure. Not a must-have for 1.5.0 so not adding to milestone. |
Deferring further investigation again, though there may be a relation with the timeouts we're seeing in #5401, which we'll poke at during the 7/23-8/5 sprint. |
Given that we'll be poking admins to finish up the v3 migration, it seems worth investing some effort to reduce likelihood of Ansible failures; tentatively added to 1.7.0 milestone (may need to be timeboxed). |
The server alive messages make detection of failed connections quicker.
TL;DR -- I finally had some luck testing this, and have updated the branch with a change I think will mitigate the problem. I wrote a simple playbook that reboots the servers, debugs the result, and updates the apt cache. Numerous runs were completely boring. Then mon took twice as long to reboot as app (~80 seconds versus ~40). That happened a couple of times. Then one run saw the app reboot work fine, again, but the playbook just froze on the mon reboot. The mon SSH control master process was still running 45 minutes later. The Ansible reboot ssh command ( I tried a separate SSH command using the same Then the playbook completed successfully. The mon server reboot was reported as taking 32 seconds. 🤯 The reboot task was using the default I tore down the SSH control master connections, and on the next run, the reboot SSH commands for both servers were hanging. Both had in fact rebooted and were available. I waited over an hour, and the SSH processes sending the shutdown command to each server still existed. I tried using the persistent connections and again, after the Going through the Ansible issues on GitHub, there are many suspects. "It's According to that comment, the clean way to destroy the persistent connections is Alternatively, SSH can be told to check the connection periodically, with I added those SSH options to A full install and subsequent reinstall both ran without incident. So I've updated this branch to just use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive research, @rmol! Given your explanations, this sounds like a solid improvement to the connection handling. Given that this sets approximately a 30s window for retrying connections, I'd like to confirm that it works on reboots that are much longer—for instance, the 1U servers we do QA on usually take 2-3 minutes to reboot. We don't need to block merge on that condition, as normal QA procedures for 1.7.0 should suffice.
Codecov Report
@@ Coverage Diff @@
## develop #4366 +/- ##
========================================
Coverage 81.41% 81.41%
========================================
Files 53 53
Lines 3965 3965
Branches 496 496
========================================
Hits 3228 3228
Misses 632 632
Partials 105 105 Continue to review full report at Codecov.
|
In the admittedly few tests I ran, my NUCs took longer than 30s to reboot, enough that the SSH connections were terminated, and everything worked. The 600-second reboot module timeouts should be what determines the success here, so I think we should see the same result on machines that take longer to boot, but yes, I'll be interested to see how this works with those servers. |
I can reproduce the issue with running |
Status
Ready for review
Description of Changes
When rebooting the SecureDrop servers in
securedrop-admin install
, remove the local Ansible SSH control path directory, ensuring stale connections won't break later steps.Fixes #4364.
Testing
I was able to reliably induce a failure doing a clean installation of the release/0.12.2 branch. The installation should get an error like
Timeout (62s) waiting for privilege escalating prompt
atSet sysctl flags for grsecurity
.With this change in place, the installation should make it past that step.
Deployment
This shouldn't affect the installed product; it only manipulates the filesystem on the admin workstation, reusing logic from
restart-tor-carefully.yml
.Checklist
If you made changes to
securedrop-admin
:make -C admin test
) pass in the admin development containerIf you made non-trivial code changes: