Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Provide troubleshooting information and remediation instructions for failed expansions #2603

Open
gdemonet opened this issue Jun 5, 2020 · 0 comments
Labels
complexity:medium Something that requires one or few days to fix kind:enhancement New feature or request priority:medium Medium priority issues, should only be postponed if no other option topic:deployment Bugs in or enhancements to deployment stages topic:docs Documentation topic:etcd Anything related to etcd topic:lifecycle Issues related to upgrade or downgrade of MetalK8s topic:operations Operations-related issues

Comments

@gdemonet
Copy link
Contributor

gdemonet commented Jun 5, 2020

Component: docs, kubernetes, etcd, systemd, containers, ...

Why this is needed:

Recently, a failed expansion in production led to a very broken cluster, and wiping and reinstalling new machines was out of the question, so we needed a manual clean-up procedure.
Such a procedure doesn't exist in our documentation today: that would have saved both developers and support teams much time to have it somewhere.

What should be done:

Describe procedures for:

  • rolling back a failed expansion on a node (remove manifests, certificates, disable services, reboot...)
  • resetting a cluster back to bootstrap-stage
  • removing a failed etcd member
  • troubleshoot Unauthorized in kubelet journal (and more examples of logs when something is broken)

Implementation proposal (strongly recommended):

Write all this in a Troubleshooting guide, reference it throughout Installation and Operation guides.

@gdemonet gdemonet added kind:enhancement New feature or request topic:operations Operations-related issues topic:docs Documentation topic:deployment Bugs in or enhancements to deployment stages topic:etcd Anything related to etcd topic:lifecycle Issues related to upgrade or downgrade of MetalK8s complexity:medium Something that requires one or few days to fix priority:medium Medium priority issues, should only be postponed if no other option labels Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity:medium Something that requires one or few days to fix kind:enhancement New feature or request priority:medium Medium priority issues, should only be postponed if no other option topic:deployment Bugs in or enhancements to deployment stages topic:docs Documentation topic:etcd Anything related to etcd topic:lifecycle Issues related to upgrade or downgrade of MetalK8s topic:operations Operations-related issues
Projects
None yet
Development

No branches or pull requests

1 participant