Pods with PV on rook-ceph doesn't start on the second node when the first node is offline #14993
Replies: 6 comments 1 reply
-
Try this doc on handling node loss |
Beta Was this translation helpful? Give feedback.
-
I was hoping there was something automatic in the rook-ceph cluster that could resolve these types of issues/locks so that when one node goes down, the pod spins up on the other and the application restarts without any problems. All without manual intervention. |
Beta Was this translation helpful? Give feedback.
-
I have isolated the node kubeclient2 with systemctl stop kubelet. pod/grafana-69d855495d-l7z2b 1/1 Running 0 17m 10.244.123.172 kubeclient1 kubectl describe pod/wordpress-mysql-65cd85d4d7-srb48 Events: Normal Scheduled 18m default-scheduler Successfully assigned default/wordpress-mysql-65cd85d4d7-srb48 to kubeclient1 root@kubemaster: |
Beta Was this translation helpful? Give feedback.
-
Hello Maduh-1,
default pod/wordpress-5b9ddb4b9d-6sph9 1/1 Running 0 23s 10.244.151.45 kubeclient2 NAMESPACE NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE VOLUMEMODE
bash-5.1$ ceph status services: data: io: bash-5.1$
root@kubemaster:~# kubectl get nodes root@kubemaster:~# kubectl get pods,nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME Logs og mysql 2024-11-12 15:42:41 1 [Note] Binlog end root@kubemaster: root@kubemaster:~# kubectl get pv,pvc -o wide NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE At this point I decide to start kubelet on node kubeclient2. I dicide to kill the pod/rook-ceph-operator-5c49669f69-j255j bash-5.1$ ceph status services: data: io: bash-5.1$ but now I have to delete the quorum 1 bash-5.1$ ceph mon rm q bash-5.1$ ceph status services: data: io: in short, several problems here and there |
Beta Was this translation helpful? Give feedback.
-
Hello all, Ciao e grazie |
Beta Was this translation helpful? Give feedback.
-
Hello all, after the last rook-ceph patch applied, now the switch and the open of oracle database on the other node due to the failire of the original node, work. Ciao Gabriele |
Beta Was this translation helpful? Give feedback.
-
I'm trying this test on my kubernetes test environment.
I try to simulate a failure on one of a worker,
and I would like to have the pods on the good node, up and running
but actually is not.
Application is on node kubelcient1
root@kubemaster:~/Node_Check_Operator# kubectl get pods,nodes -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/grafana-69d855495d-qjfgd 1/1 Running 0 14m 10.244.151.57 kubeclient1
pod/wordpress-5b9ddb4b9d-ld6lv 1/1 Running 7 (5m27s ago) 14m 10.244.151.43 kubeclient1
pod/wordpress-mysql-65cd85d4d7-bf79h 1/1 Running 4 (6m39s ago) 14m 10.244.151.34 kubeclient1
systemctl stop kubelet on node kubeclient1
pods on kubeclient1 goes in terminating state but dont release lock on PV
and the pods on node kubeclient2 goes in loop error
kubectl get pods
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/grafana-69d855495d-hpfhp 1/1 Terminating 0 89m 10.244.123.163 kubeclient1
pod/grafana-69d855495d-qjfgd 1/1 Running 0 8m56s 10.244.151.57 kubeclient2
pod/wordpress-5b9ddb4b9d-ld6lv 0/1 CrashLoopBackOff 5 (2m44s ago) 8m56s 10.244.151.43 kubeclient2
pod/wordpress-5b9ddb4b9d-lr2rs 1/1 Terminating 4 (85m ago) 88m 10.244.123.149 kubeclient1
pod/wordpress-mysql-65cd85d4d7-26p28 1/1 Terminating 1 (86m ago) 89m 10.244.123.170 kubeclient1
pod/wordpress-mysql-65cd85d4d7-bf79h 1/1 Running 4 (43s ago) 8m56s 10.244.151.34 kubeclient2
kubectl get volumeattachment
NAME ATTACHER PV NODE ATTACHED AGE
csi-0211bb93fad967b204c6254e34680757cae2c93000977ab37d11e51a596d4fed rook-ceph.cephfs.csi.ceph.com pvc-46508706-ec05-4ab1-954a-54462f0e425c kubeclient2 true 10m
csi-590226f253850369b23ee9210ea2224f4a6cffe5c969eb2e46d494b9f334bea5 rook-ceph.cephfs.csi.ceph.com pvc-cd07016b-b5e6-4b87-b1c2-cf9bd913750d kubeclient1 true 90m
csi-9f3a64fd4123fd6c014ddc69f55db0d9ff05beb6ecae5b8c869ebdac1aa6c374 rook-ceph.cephfs.csi.ceph.com pvc-4c701e0b-8447-4312-8e38-495375bcfd98 kubeclient1 true 90m
csi-b9b3787eac769b8374ff4a8e96c531762f11f653f32da1c320bb12104f5e3da3 rook-ceph.cephfs.csi.ceph.com pvc-cd07016b-b5e6-4b87-b1c2-cf9bd913750d kubeclient2 true 10m
csi-c07f7a4420565d25008d365914d4d6a1f0227b3f10f11ce49830707c7fb55e7d rook-ceph.cephfs.csi.ceph.com pvc-46508706-ec05-4ab1-954a-54462f0e425c kubeclient1 true 90m
csi-e579a076da144ca7e95b768e2cc21cdd78dc8c870cd235540d67e0fefb767fb5 rook-ceph.cephfs.csi.ceph.com pvc-4c701e0b-8447-4312-8e38-495375bcfd98 kubeclient2 true 10m
So to resume the situation i execute following command but I thought kubernetes/rook-ceph solved this issues automatically
kubectl delete pod/grafana-69d855495d-hpfhp pod/wordpress-5b9ddb4b9d-lr2rs pod/wordpress-mysql-65cd85d4d7-26p28 --force
Done this, the situation is follow:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/grafana-69d855495d-qjfgd 1/1 Running 0 9m49s 10.244.151.57 kubeclient2
pod/wordpress-5b9ddb4b9d-ld6lv 0/1 CrashLoopBackOff 6 (24s ago) 9m49s 10.244.151.43 kubeclient2
pod/wordpress-mysql-65cd85d4d7-bf79h 1/1 Running 4 (96s ago) 9m49s 10.244.151.34 kubeclient2
To start definitely the pods it is necessary
to execute systemctl start kubelet on node kubeclient1
In this way the lock is released and the pods start in the right way on kubeclient2
How to solve this problem?
I woluld like that the application that use PV start on the other node automatically
so I would like the lock on the first node is released automatically if the node goes down
or became unreacable.
Many thanks for help
Gabriele
Beta Was this translation helpful? Give feedback.
All reactions