Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate operations to be done after a node is removed from cluster #87

Open
kmova opened this issue Aug 16, 2021 · 6 comments
Open

Automate operations to be done after a node is removed from cluster #87

kmova opened this issue Aug 16, 2021 · 6 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@kmova
Copy link
Member

kmova commented Aug 16, 2021

Describe the problem/challenge you have

When a node running the stateful pod with Local PV goes out of the cluster, the pod gets into a pending state and remains in a pending state. The administrator or the automated operator will have to run some manual steps to bring the pod back online. The operations to be performed may vary depending on the way storage is connected to the nodes. However, a few options are common across different stateful operators. The general actions to be performed are:

  • Run checks to ensure PV is really dead.
    • The node on which PV was provisioned is gone. The node mentioned in the affinity is no longer available in the cluster.
    • The application to which PV was provisioned is deleted. The PV remained in the system due to some failure condition during delete operations.
    • The application is in a state to add new replica (delete old pv/replica and let stateful operator create new pv/replica)
    • other application-specific checks
  • Delete the PV and the application PVC/replica
    • If reclaimPolicy is delete, delete the PV
    • If reclaimPolicy is retain, remove references
  • Wait for a new PV to be created and run some post operations specific to the application like:
  • Run a command/api on application to rebalance

Describe the solution you'd like
A Kubernetes operator that can be launched into the cluster with a ConfigMap(s) that can specify:

  • annotations or configuration in the PV spec that uniquely identify the PVs to be acted upon
  • enable/disable pre-checks
  • enable/disable post-hooks

Anything else you would like to add:
It should be possible to either run this operator independently or embed this controller into other stateful operators.

@kmova kmova added enhancement New feature or request help wanted Extra attention is needed labels Aug 16, 2021
@jimil749
Copy link

Hey @kmova! Came across this via LFX-Mentorship repo. Would love to get involved and apply for the same! I have had previous experience on working on open source projects (participated in GSoC'21 under CERN) and would love to continue learning! I had a question regarding the LFX mentorship application: Do we need to draft a proposal for the project?

@AmitKumarDas
Copy link
Member

Can we reproduce this problem at will?

When a node running the stateful pod with Local PV goes out of the cluster, the pod gets into a pending state and remains in a pending state.

It will be good to provide the reproduce with error messages & state of the system when this issue happens.

@frezbo
Copy link

frezbo commented May 26, 2022

I faced the same issue while testing jiva, if the disk on the node backing the local pv gets removed/re-installed the new local mount path comes up with an empty formatted volume and the replica pod remains in a pending state complaining the local path doesn't exist.

@vsabavat
Copy link

@niladrih do you have any update on this issue? Happy to help, the issue is also easily demo'able using k3d.

@niladrih
Copy link
Member

niladrih commented Jun 5, 2024

For applications which are deployed with high-availability, and can recover/rebuild the data from the lost node, persistentVolumeClaimRetentionPolicy seems like a possible solution. It is in beta (k8s v1.27 onwards), and would need the cluster admin/app operator to delete the Pod which was scheduled on the lost node.

@frezbo @vsabavat WDYT?

@avishnu avishnu modified the milestones: v4.2, v4.3 Oct 3, 2024
@avishnu
Copy link
Member

avishnu commented Oct 3, 2024

Assigning this to milestone v4.3 with design as the scope.

@avishnu avishnu removed this from the v4.3 milestone Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants