docs: Provide troubleshooting information and remediation instructions for failed expansions #2603
Labels
complexity:medium
Something that requires one or few days to fix
kind:enhancement
New feature or request
priority:medium
Medium priority issues, should only be postponed if no other option
topic:deployment
Bugs in or enhancements to deployment stages
topic:docs
Documentation
topic:etcd
Anything related to etcd
topic:lifecycle
Issues related to upgrade or downgrade of MetalK8s
topic:operations
Operations-related issues
Component: docs, kubernetes, etcd, systemd, containers, ...
Why this is needed:
Recently, a failed expansion in production led to a very broken cluster, and wiping and reinstalling new machines was out of the question, so we needed a manual clean-up procedure.
Such a procedure doesn't exist in our documentation today: that would have saved both developers and support teams much time to have it somewhere.
What should be done:
Describe procedures for:
Unauthorized
in kubelet journal (and more examples of logs when something is broken)Implementation proposal (strongly recommended):
Write all this in a Troubleshooting guide, reference it throughout Installation and Operation guides.
The text was updated successfully, but these errors were encountered: