[BUG] Duplicate channel close error in the backing image manage related components #4865

shuo-wu · 2022-11-14T02:57:22Z

Describe the bug (🐛 if you encounter this issue)

There is duplicate channel closing that will lead to the longhorn manager panic.
For example, when the backing image data source is unreachable for a while and the monitor failed to retry connecting it, the stop channel will be closed both in the sync function and in a separate stop monitoring function.

To Reproduce

Steps to reproduce the behavior:

Launch a large backing image
During the backing image downloading and the backing image data source pod is running, freeze the process or cut the network for data source pod. Then wait for a while
Check the longhorn manager pod. see the error panic: close of closed channel

Expected behavior

The backing image downloading will be marked as failed without any panic

Log or Support bundle

longhorn-manager-bkx9h longhorn-manager time="2022-11-14T01:49:45Z" level=error msg="There's no available disk for replica pvc-0c193aed-19fb-47dc-804d-e41d48094646-r-cd5ed9e9, size 107374182400"
longhorn-manager-bkx9h longhorn-manager time="2022-11-14T01:49:45Z" level=error msg="unable to schedule replica" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=harvester-v7tfx owner=harvester-v7tfx replica=pvc-0c193aed-19fb-47dc-804d-e41d48094646-r-cd5ed9e9 state=attached volume=pvc-0c193aed-19fb-47dc-804d-e41d48094646
longhorn-manager-bkx9h longhorn-manager time="2022-11-14T01:49:48Z" level=error msg="failed to get default-image-5jcqf info from backing image data source server: get failed, err: Get \"http://10.52.1.23:8000/v1/file\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" backingImageDataSource=default-image-5jcqf controller=longhorn-backing-image-data-source diskUUID=95813962-0cf4-4224-8378-49a0cec8ae93 node=harvester-v7tfx nodeID=harvester-v7tfx parameters="map[url:http://harvester-vm-import-controller.harvester-system.svc:8080/gm-rke-default-disk-0.img]" sourceType=download
longhorn-manager-bkx9h longhorn-manager time="2022-11-14T01:49:48Z" level=warning msg="Stop monitoring since monitor default-image-5jcqf sync reaches the max retry count 10" backingImageDataSource=default-image-5jcqf controller=longhorn-backing-image-data-source diskUUID=95813962-0cf4-4224-8378-49a0cec8ae93 node=harvester-v7tfx nodeID=harvester-v7tfx parameters="map[url:http://harvester-vm-import-controller.harvester-system.svc:8080/gm-rke-default-disk-0.img]" sourceType=download
longhorn-manager-bkx9h longhorn-manager time="2022-11-14T01:49:48Z" level=info msg="Stopping monitoring" backingImageDataSource=default-image-5jcqf controller=longhorn-backing-image-data-source node=harvester-v7tfx
longhorn-manager-bkx9h longhorn-manager panic: close of closed channel
longhorn-manager-bkx9h longhorn-manager 
longhorn-manager-bkx9h longhorn-manager goroutine 3867 [running]:
longhorn-manager-bkx9h longhorn-manager github.com/longhorn/longhorn-manager/controller.(*BackingImageDataSourceController).stopMonitoring(0xc0004c0000, {0xc00370e270, 0x13})
longhorn-manager-bkx9h longhorn-manager         /go/src/github.com/longhorn/longhorn-manager/controller/backing_image_data_source_controller.go:930 +0x145
longhorn-manager-bkx9h longhorn-manager github.com/longhorn/longhorn-manager/controller.(*BackingImageDataSourceController).startMonitoring.func1()
longhorn-manager-bkx9h longhorn-manager         /go/src/github.com/longhorn/longhorn-manager/controller/backing_image_data_source_controller.go:970 +0x4c
longhorn-manager-bkx9h longhorn-manager created by github.com/longhorn/longhorn-manager/controller.(*BackingImageDataSourceController).startMonitoring
longhorn-manager-bkx9h longhorn-manager         /go/src/github.com/longhorn/longhorn-manager/controller/backing_image_data_source_controller.go:968 +0x465

Environment

Longhorn version: v1.3.2 (Harvester v1.1.0)
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):

Additional context

Thanks for @ibrokethecloud's reporting!

The text was updated successfully, but these errors were encountered:

roger-ryao · 2022-12-07T03:04:25Z

Hi @innobead
I will verify this issue after @weizhe0422 fixes it.

longhorn-io-github-bot · 2022-12-07T09:45:53Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:

Launch a large backing image and check the BackingImageDatasource POD should be launched
- Create a large size volume, and create backing image via Export from a Longhorn volume

Block all ingress and outgress package with applying network policy, check the progress status should be paused after few seconds.

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: deny-pod-all-packages
  namespace: longhorn-system
spec:
  podSelector:
    matchLabels:
      longhorn.io/backing-image-data-source: <your-backing-image-name>
      longhorn.io/component: backing-image-data-source
  policyTypes:
  - Ingress
  - Egress

Check the logs of the longhorn manager pod, it should not show the error panic: close of closed channel. And after the monitoring is stopped because it cannot connect to the data source POD, the next round of monitoring will continue.

Is there a workaround for the issue? If so, where is it documented?
The workaround is at:
Since the lognhorn manager POD will panic and restart automatically, it will resume normal if the BackingImageDatasource POD is recovered.
Does the PR include the explanation for the fix or the feature?
fix: Check if the channel has been closed before closing longhorn-manager#1590 (comment)
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at fix: Check if the channel has been closed before closing longhorn-manager#1590

weizhe0422 · 2022-12-08T03:15:24Z

Update the e2e result for backingimage is pass

roger-ryao · 2022-12-08T10:16:29Z

Verified on master-head 20221208

longhorn master-head (13ac2a66)
longhorn-manager master-head (5a07e68)

The test steps

Scenario 1
Ref #4865 (comment)

Scenario 2

Upload a local file as a backing image and check the BackingImageDatasource POD should be launched
Access to BackingImageDatasource POD's node
Find out the process ID and, freeze the process
ps aux | grep 'backing-image-manager --debug data-source' |grep -v grep | awk '{print $2}' | xargs -i kill -STOP {}

Result
Scenario 1

Unblock BackingImageDatasource pod ingress and outgress package
The backing image downloading will be continued and marked as ready

Scenario 2

Unfreeze the process
ps aux | grep 'backing-image-manager --debug data-source' |grep -v grep | awk '{print $2}' | xargs -i kill -CONT {}
The backing image downloading will be marked as failed without any panic

w13915984028 · 2023-02-14T15:56:28Z

@roger-ryao @shuo-wu : what is the largest file ever tested?

users are reporting with a 75GB local file to upload, it failed.
harvester/harvester#3450

could it be related to POD resource restriction or disk space pressure ? thanks.
cc @chrisho

innobead · 2023-02-14T16:13:27Z

@w13915984028 What longhorn version is the user using? This fix is only going to 1.4.0 and 1.3.3.

w13915984028 · 2023-02-14T16:23:09Z

@innobead
The issue is reported in Harvester v1.1.1, which has longhorn: helm.sh/chart: longhorn-1.3.2.

This fix, #4865, should not in Harvester v1.1.1.

And further, will big file causes potential issue to LH? That's the reason I want to know the ever tested largest single file size. thanks.

innobead · 2023-02-14T16:32:42Z

I see. If that's streaming, don't think that's a problem. Let's see the update from @roger-ryao / @shuo-wu .

roger-ryao · 2023-02-15T00:39:12Z

@roger-ryao @shuo-wu : what is the largest file ever tested?

users are reporting with a 75GB local file to upload, it failed. harvester/harvester#3450

could it be related to POD resource restriction or disk space pressure ? thanks. cc @chrisho

Hi @w13915984028
I verified this issue with a 5GB local file to upload.

w13915984028 · 2023-02-15T09:39:49Z

@roger-ryao @innobead

We suspect the large image file (e.g. 50Gb ~ 100Gb) may cause pressue to the PODs in the uploading path, when possible, please test with such big image files. And also add them into auto test.

LH will have a specification, saying which size of image can be handled smoothly.

thanks.

shuo-wu · 2023-02-15T09:51:22Z

@w13915984028 Is the large file uploading failure caused by this kind of issue?
#4902

w13915984028 · 2023-02-15T10:36:30Z

@shuo-wu
The straightforward issue is
harvester/harvester#3450, some POD crashed inbetween the uploading an 75Gb single-file image.

#4902 touchs more components, the vm-importer is an upper layer controller, which utilize the uploading functionality.

innobead · 2023-02-15T10:52:10Z

@w13915984028 Let's create another issue to track large file uploading instead to clarify this further.

w13915984028 · 2023-02-15T13:31:29Z

OK, lets discuss further in #5395, thanks.

shuo-wu added kind/bug component/longhorn-manager Longhorn manager (control plane) area/backing-image Backing image related labels Nov 14, 2022

innobead assigned shuo-wu Nov 14, 2022

innobead added priority/0 Must be implement or fixed in this release (managed by PO) severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) reproduce/always 100% reproducible labels Nov 14, 2022

innobead added this to the v1.4.0 milestone Nov 14, 2022

innobead added the backport/1.3.3 label Nov 14, 2022

github-actions bot mentioned this issue Nov 14, 2022

[BACKPORT][v1.3.3][BUG] Duplicate channel close error in the backing image manage related components #4866

Closed

ibrokethecloud mentioned this issue Nov 16, 2022

[BUG] VirtualMachineImport fails during large image upload harvester/harvester#3155

Open

innobead assigned weizhe0422 and unassigned shuo-wu Nov 28, 2022

roger-ryao self-assigned this Dec 7, 2022

weizhe0422 mentioned this issue Dec 7, 2022

fix: Check if the channel has been closed before closing longhorn/longhorn-manager#1590

Merged

roger-ryao closed this as completed Dec 8, 2022

w13915984028 mentioned this issue Feb 14, 2023

Harvester to-be-investigated issue tracking w13915984028/harvester-develop-summary#3

Open

w13915984028 mentioned this issue Feb 15, 2023

[IMPROVEMENT] Expect LH to clarify the uploading file specification #5395

Closed

w13915984028 mentioned this issue Feb 15, 2023

[BUG] Uploading qcow2 image from local file fails with Context Canceled harvester/harvester#3450

Closed

ChanYiLin mentioned this issue Jul 6, 2023

[BUG] Longhorn manager crashed during backing image 100gb volume export #5209

Closed

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Duplicate channel close error in the backing image manage related components #4865

[BUG] Duplicate channel close error in the backing image manage related components #4865

shuo-wu commented Nov 14, 2022

roger-ryao commented Dec 7, 2022

longhorn-io-github-bot commented Dec 7, 2022 •

edited by weizhe0422

Loading

weizhe0422 commented Dec 8, 2022

roger-ryao commented Dec 8, 2022

w13915984028 commented Feb 14, 2023

innobead commented Feb 14, 2023 •

edited

Loading

w13915984028 commented Feb 14, 2023 •

edited

Loading

innobead commented Feb 14, 2023 •

edited

Loading

roger-ryao commented Feb 15, 2023

w13915984028 commented Feb 15, 2023

shuo-wu commented Feb 15, 2023

w13915984028 commented Feb 15, 2023

innobead commented Feb 15, 2023

w13915984028 commented Feb 15, 2023 •

edited

Loading

[BUG] Duplicate channel close error in the backing image manage related components #4865

[BUG] Duplicate channel close error in the backing image manage related components #4865

Comments

shuo-wu commented Nov 14, 2022

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

roger-ryao commented Dec 7, 2022

longhorn-io-github-bot commented Dec 7, 2022 • edited by weizhe0422 Loading

Pre Ready-For-Testing Checklist

weizhe0422 commented Dec 8, 2022

roger-ryao commented Dec 8, 2022

w13915984028 commented Feb 14, 2023

innobead commented Feb 14, 2023 • edited Loading

w13915984028 commented Feb 14, 2023 • edited Loading

innobead commented Feb 14, 2023 • edited Loading

roger-ryao commented Feb 15, 2023

w13915984028 commented Feb 15, 2023

shuo-wu commented Feb 15, 2023

w13915984028 commented Feb 15, 2023

innobead commented Feb 15, 2023

w13915984028 commented Feb 15, 2023 • edited Loading

longhorn-io-github-bot commented Dec 7, 2022 •

edited by weizhe0422

Loading

innobead commented Feb 14, 2023 •

edited

Loading

w13915984028 commented Feb 14, 2023 •

edited

Loading

innobead commented Feb 14, 2023 •

edited

Loading

w13915984028 commented Feb 15, 2023 •

edited

Loading