Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Duplicate channel close error in the backing image manage related components #4865

Closed
shuo-wu opened this issue Nov 14, 2022 · 14 comments
Assignees
Labels
area/backing-image Backing image related backport/1.3.3 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/always 100% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@shuo-wu
Copy link
Contributor

shuo-wu commented Nov 14, 2022

Describe the bug (🐛 if you encounter this issue)

There is duplicate channel closing that will lead to the longhorn manager panic.
For example, when the backing image data source is unreachable for a while and the monitor failed to retry connecting it, the stop channel will be closed both in the sync function and in a separate stop monitoring function.

To Reproduce

Steps to reproduce the behavior:

  1. Launch a large backing image
  2. During the backing image downloading and the backing image data source pod is running, freeze the process or cut the network for data source pod. Then wait for a while
  3. Check the longhorn manager pod. see the error panic: close of closed channel

Expected behavior

The backing image downloading will be marked as failed without any panic

Log or Support bundle

longhorn-manager-bkx9h longhorn-manager time="2022-11-14T01:49:45Z" level=error msg="There's no available disk for replica pvc-0c193aed-19fb-47dc-804d-e41d48094646-r-cd5ed9e9, size 107374182400"
longhorn-manager-bkx9h longhorn-manager time="2022-11-14T01:49:45Z" level=error msg="unable to schedule replica" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=harvester-v7tfx owner=harvester-v7tfx replica=pvc-0c193aed-19fb-47dc-804d-e41d48094646-r-cd5ed9e9 state=attached volume=pvc-0c193aed-19fb-47dc-804d-e41d48094646
longhorn-manager-bkx9h longhorn-manager time="2022-11-14T01:49:48Z" level=error msg="failed to get default-image-5jcqf info from backing image data source server: get failed, err: Get \"http://10.52.1.23:8000/v1/file\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" backingImageDataSource=default-image-5jcqf controller=longhorn-backing-image-data-source diskUUID=95813962-0cf4-4224-8378-49a0cec8ae93 node=harvester-v7tfx nodeID=harvester-v7tfx parameters="map[url:http://harvester-vm-import-controller.harvester-system.svc:8080/gm-rke-default-disk-0.img]" sourceType=download
longhorn-manager-bkx9h longhorn-manager time="2022-11-14T01:49:48Z" level=warning msg="Stop monitoring since monitor default-image-5jcqf sync reaches the max retry count 10" backingImageDataSource=default-image-5jcqf controller=longhorn-backing-image-data-source diskUUID=95813962-0cf4-4224-8378-49a0cec8ae93 node=harvester-v7tfx nodeID=harvester-v7tfx parameters="map[url:http://harvester-vm-import-controller.harvester-system.svc:8080/gm-rke-default-disk-0.img]" sourceType=download
longhorn-manager-bkx9h longhorn-manager time="2022-11-14T01:49:48Z" level=info msg="Stopping monitoring" backingImageDataSource=default-image-5jcqf controller=longhorn-backing-image-data-source node=harvester-v7tfx
longhorn-manager-bkx9h longhorn-manager panic: close of closed channel
longhorn-manager-bkx9h longhorn-manager 
longhorn-manager-bkx9h longhorn-manager goroutine 3867 [running]:
longhorn-manager-bkx9h longhorn-manager github.com/longhorn/longhorn-manager/controller.(*BackingImageDataSourceController).stopMonitoring(0xc0004c0000, {0xc00370e270, 0x13})
longhorn-manager-bkx9h longhorn-manager         /go/src/github.com/longhorn/longhorn-manager/controller/backing_image_data_source_controller.go:930 +0x145
longhorn-manager-bkx9h longhorn-manager github.com/longhorn/longhorn-manager/controller.(*BackingImageDataSourceController).startMonitoring.func1()
longhorn-manager-bkx9h longhorn-manager         /go/src/github.com/longhorn/longhorn-manager/controller/backing_image_data_source_controller.go:970 +0x4c
longhorn-manager-bkx9h longhorn-manager created by github.com/longhorn/longhorn-manager/controller.(*BackingImageDataSourceController).startMonitoring
longhorn-manager-bkx9h longhorn-manager         /go/src/github.com/longhorn/longhorn-manager/controller/backing_image_data_source_controller.go:968 +0x465

Environment

  • Longhorn version: v1.3.2 (Harvester v1.1.0)
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):

Additional context

Thanks for @ibrokethecloud's reporting!

@shuo-wu shuo-wu added kind/bug component/longhorn-manager Longhorn manager (control plane) area/backing-image Backing image related labels Nov 14, 2022
@innobead innobead added priority/0 Must be implement or fixed in this release (managed by PO) severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) reproduce/always 100% reproducible labels Nov 14, 2022
@innobead innobead added this to the v1.4.0 milestone Nov 14, 2022
@innobead innobead assigned weizhe0422 and unassigned shuo-wu Nov 28, 2022
@roger-ryao
Copy link

Hi @innobead
I will verify this issue after @weizhe0422 fixes it.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 7, 2022

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  1. Launch a large backing image and check the BackingImageDatasource POD should be launched
    • Create a large size volume, and create backing image via Export from a Longhorn volume
  2. Block all ingress and outgress package with applying network policy, check the progress status should be paused after few seconds.
    kind: NetworkPolicy
    apiVersion: networking.k8s.io/v1
    metadata:
      name: deny-pod-all-packages
      namespace: longhorn-system
    spec:
      podSelector:
        matchLabels:
          longhorn.io/backing-image-data-source: <your-backing-image-name>
          longhorn.io/component: backing-image-data-source
      policyTypes:
      - Ingress
      - Egress
  3. Check the logs of the longhorn manager pod, it should not show the error panic: close of closed channel. And after the monitoring is stopped because it cannot connect to the data source POD, the next round of monitoring will continue.
    image

@weizhe0422
Copy link
Contributor

Update the e2e result for backingimage is pass
image

@roger-ryao
Copy link

Verified on master-head 20221208

The test steps

Scenario 1
Ref #4865 (comment)

Scenario 2

  1. Upload a local file as a backing image and check the BackingImageDatasource POD should be launched
  2. Access to BackingImageDatasource POD's node
  3. Find out the process ID and, freeze the process
    ps aux | grep 'backing-image-manager --debug data-source' |grep -v grep | awk '{print $2}' | xargs -i kill -STOP {}

Result
Scenario 1

  1. Unblock BackingImageDatasource pod ingress and outgress package
  2. The backing image downloading will be continued and marked as ready

Screenshot_20221208_181515

Scenario 2

  1. Unfreeze the process
    ps aux | grep 'backing-image-manager --debug data-source' |grep -v grep | awk '{print $2}' | xargs -i kill -CONT {}
  2. The backing image downloading will be marked as failed without any panic
    Screenshot_20221208_175823

@w13915984028
Copy link

@roger-ryao @shuo-wu : what is the largest file ever tested?

users are reporting with a 75GB local file to upload, it failed.
harvester/harvester#3450

could it be related to POD resource restriction or disk space pressure ? thanks.
cc @chrisho

@innobead
Copy link
Member

innobead commented Feb 14, 2023

@w13915984028 What longhorn version is the user using? This fix is only going to 1.4.0 and 1.3.3.

@w13915984028
Copy link

w13915984028 commented Feb 14, 2023

@innobead
The issue is reported in Harvester v1.1.1, which has longhorn: helm.sh/chart: longhorn-1.3.2.

This fix, #4865, should not in Harvester v1.1.1.

And further, will big file causes potential issue to LH? That's the reason I want to know the ever tested largest single file size. thanks.

@innobead
Copy link
Member

innobead commented Feb 14, 2023

I see. If that's streaming, don't think that's a problem. Let's see the update from @roger-ryao / @shuo-wu .

@roger-ryao
Copy link

@roger-ryao @shuo-wu : what is the largest file ever tested?

users are reporting with a 75GB local file to upload, it failed. harvester/harvester#3450

could it be related to POD resource restriction or disk space pressure ? thanks. cc @chrisho

Hi @w13915984028
I verified this issue with a 5GB local file to upload.

@w13915984028
Copy link

@roger-ryao @innobead

We suspect the large image file (e.g. 50Gb ~ 100Gb) may cause pressue to the PODs in the uploading path, when possible, please test with such big image files. And also add them into auto test.

LH will have a specification, saying which size of image can be handled smoothly.

thanks.

@shuo-wu
Copy link
Contributor Author

shuo-wu commented Feb 15, 2023

@w13915984028 Is the large file uploading failure caused by this kind of issue?
#4902

@w13915984028
Copy link

@shuo-wu
The straightforward issue is
harvester/harvester#3450, some POD crashed inbetween the uploading an 75Gb single-file image.

#4902 touchs more components, the vm-importer is an upper layer controller, which utilize the uploading functionality.

@innobead
Copy link
Member

@w13915984028 Let's create another issue to track large file uploading instead to clarify this further.

@w13915984028
Copy link

w13915984028 commented Feb 15, 2023

OK, lets discuss further in #5395, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backing-image Backing image related backport/1.3.3 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/always 100% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
Status: Closed
Development

No branches or pull requests

6 participants