-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][v1.8.0-rc2] DR volume becomes faulted after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore #10105
Comments
|
|
@COLDTURNIP Can you help investigate the issue? Thanks. |
Hi @yangchiu, As @derekbit mentioned, do you have the support bundles of the source clusters? Would it be like this scenario:
@COLDTURNIP, could you check whether this scenario caused the issue? |
Yes, |
Have tried to reproduce such problem using a local environment:
Unfortunately, the issue does not occur during one night trial. We're now trying to insert some delay in the manager to verify if there's some possibility of the data racing between the clusters and the backup store. |
@yangchiu Please help check if @COLDTURNIP steps are correct. Also reproduce more times to make sure cc @longhorn/qa |
Unable to reproduce the issue even by adding more delay during fetching backups from backup store. |
Thanks @COLDTURNIP Moved to ready-for-testing, add |
This should be able to be reproduced by manually delete the |
@yangchiu |
Yes, it easily simulates
No, it happens naturally, but it's not reproducible at this point.
From the investigation, it's deleted by the recurring job with Let's close this for now, as it is not reproducible at this time. |
No worries. Let's keep an eye on the DR or incremental backup and test the functionalities more. |
@COLDTURNIP Before closing the issue, can you set up the destination and the source clusters and automatically execute #10105 (comment) for one day to make sure the issue won't happen? cc @yangchiu let's do a long-term test first. |
I replicated a similar situation this morning. I provided the support bundle to @COLDTURNIP and also shared the environment with him to see if he could replicate it again. On
statefulset.yamlapiVersion: v1
kind: Service
metadata:
name: nginx-state-rwo
labels:
app: nginx-state-rwo
spec:
ports:
- port: 80
name: web-state-rwo
selector:
app: nginx-state-rwo
type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-state-rwo
spec:
selector:
matchLabels:
app: nginx-state-rwo # has to match .spec.template.metadata.labels
serviceName: "nginx-state-rwo"
replicas: 1 # by default is 1
template:
metadata:
labels:
app: nginx-state-rwo # has to match .spec.selector.matchLabels
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
containers:
- name: nginx-state-rwo
image: nginx:stable
livenessProbe:
exec:
command:
- ls
- /usr/share/nginx/html/lost+found
initialDelaySeconds: 5
periodSeconds: 5
ports:
- containerPort: 80
name: web-state-rwo
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
# accessModes:
# - ReadWriteOnce
# - ReadWriteMany
storageClassName: "longhorn"
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: nginx-state-rwx
labels:
app: nginx-state-rwx
spec:
ports:
- port: 80
name: web-state-rwx
selector:
app: nginx-state-rwx
type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-state-rwx
spec:
selector:
matchLabels:
app: nginx-state-rwx # has to match .spec.template.metadata.labels
serviceName: "nginx-state-rwx"
replicas: 1 # by default is 1
template:
metadata:
labels:
app: nginx-state-rwx # has to match .spec.selector.matchLabels
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
containers:
- name: nginx-state-rwx
image: nginx:stable
livenessProbe:
exec:
command:
- ls
- /usr/share/nginx/html/lost+found
initialDelaySeconds: 5
periodSeconds: 5
ports:
- containerPort: 80
name: web-state-rwx
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteMany" ]
storageClassName: "longhorn"
resources:
requests:
storage: 1Gi
node-disconnect.sh#!/bin/bash
if [ $# -ne 1 ]; then
echo "Usage: $0 <file_size_in_MBs>"
exit 1
fi
file_size=$1
# Write some data to the volume and sync
kubectl exec -it web-state-rwo-0 -- /bin/bash -c "dd if=/dev/urandom of=/usr/share/nginx/html/"$file_size"m bs=1M count=$file_size oflag=direct status=progress && md5sum /usr/share/nginx/html/"$file_size"m"
kubectl exec -it web-state-rwx-0 -- /bin/bash -c "dd if=/dev/urandom of=/usr/share/nginx/html/"$file_size"m bs=1M count=$file_size oflag=direct status=progress && md5sum /usr/share/nginx/html/"$file_size"m"
kubectl exec -it web-state-rwo-0 -- /bin/bash -c "sync"
kubectl exec -it web-state-rwx-0 -- /bin/bash -c "sync"
On
|
Added back to 1.8.0 first. |
@roger-ryao Can you try if the issue happens in v1.7.2 as well? |
Okay, I will set up another environment and follow the same steps to run it for a day. I will update the results tomorrow afternoon. |
I was able to reproduce the issue on volume name : |
Okay this is not a regression from 1.8.0 but an existing issue at least since 1.7.2. |
Describe the bug
DR volume becomes faulted permanently after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore:
To Reproduce
every minute
Expected behavior
Support bundle for troubleshooting
supportbundle_dc5cb6f7-5584-4153-8d1e-1b823b393fe2_2024-12-31T04-05-09Z.zip
Environment
Additional context
Workaround and Mitigation
The text was updated successfully, but these errors were encountered: