Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Longhorn Helm Chart doesn't have tolerations for CSI plugin and Engine image daemonsets #6606

Open
KostLinux opened this issue Aug 29, 2023 · 6 comments
Assignees
Labels
area/install-uninstall-upgrade Install, Uninstall or Upgrade related area/setting Global setting or volume setting investigation-needed Identified the issue but require further investigation for resolution (won't be stale) kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/qa-review-coverage Require QA to review coverage severity/3 Function working but has a major issue w/ workaround
Milestone

Comments

@KostLinux
Copy link

Describe the bug (🐛 if you encounter this issue)

After upgrading Longhorn from the Helm chart, the engine image and CSI plugin do not have tolerations, resulting in those components not being applied to storage nodes.

To Reproduce

  1. Create a few workers, with three of them labeled as Longhorn nodes.
  2. Apply taints on the three Longhorn nodes.
  3. Upgrade Longhorn using the Helm chart, ensuring to include tolerations.
  4. Go to the Longhorn UI and navigate to Settings > Engine Image tab.
  5. Click on the engine image.
  6. Observe that the engine image is only deployed on untainted nodes.
  7. Add tolerations to the engine image using kubectl edit.
  8. Verify that the engine image is now deployed on the storage nodes.

Expected behavior

When upgrading Longhorn from the Helm chart, the engine images and CSI plugin daemonsets should be automatically deployed on storage nodes.

Support bundle for troubleshooting

Environment

  • Longhorn version: v1.4.2
  • Installation method: Rancher Catalog App
  • Kubernetes distro and version: RKE / Kubernetes v1.24.8
  • Number of management nodes in the cluster: 3
  • Number of worker nodes in the cluster: 8
  • Node Configuration:
  • OS type and version: Oracle Linux 8
  • CPU per node: 8
  • Memory per node: 16 GB
  • Disk type: 200 GB
  • Network bandwidth between the nodes: 1 Gbps
  • Underlying Infrastructure: VMware
  • Number of Longhorn volumes in the cluster: 500

Additional context

It is important to note that some PVCs may come from an older version of Longhorn. In such cases, the engine images for those PVCs will not be updated until the daemonsets are manually edited. Ensure that the engine image automatic update option is enabled.

Also the reason i've created another bug issue, you've ignored this issue > #6103

@innobead
Copy link
Member

Thanks for reporting this. I closed the original, so we can keep this.

This part should not have a recent change. cc @ChanYiLin follow up.

@c3y1huang c3y1huang added the area/install-uninstall-upgrade Install, Uninstall or Upgrade related label Sep 26, 2023
@sybnex
Copy link

sybnex commented Jan 29, 2024

I ran into this issue today ... is this still on the roadmap?

@kreeger
Copy link

kreeger commented Sep 2, 2024

I ran into this today as well with a fresh Helm install; I'm setting defaultSettings.taintToleration: Storage=true:NoSchedule in my values (with Storage=true:NoSchedule as the taint applied to my 3 Longhorn-only nodes) and it doesn't seem to get applied to the engine-image-ei-04c05bf8 or longhorn-csi-plugin DaemonSets.

defaultSettings:
  createDefaultDiskLabeledNodes: true
  defaultReplicaCount: 2
  taintToleration: Storage=true:NoSchedule
  removeSnapshotsDuringFilesystemTrim: enabled

I can see it clear as day in my longhorn-default-setting ConfigMap though, for what it's worth, but I don't think Longhorn actually picks up this config map; if I browse to Longhorn UI's settings, all those have their default values applied (including an empty toleration in the Danger Zone) and if I add a toleration there, it works. All of the Setting objects in my longhorn-system namespace all have their default values. Here's my ConfigMap:

create-default-disk-labeled-nodes: true

default-replica-count: 2

taint-toleration: Storage=true:NoSchedule

priority-class: longhorn-critical

disable-revision-counter: true

remove-snapshots-during-filesystem-trim: enabled

Happy to help troubleshoot, I'm on v1.30.4+rke2r1. One would think I've hit #2562, but I'm using only settings that are defined in the v1.7.0 values.yaml.


  • Longhorn version: v1.7.0
  • Installation method: Helm
  • Kubernetes distro and version: v1.30.4+rke2r1
  • Number of management nodes in the cluster: 3
  • Number of worker nodes in the cluster: 6
  • Node Configuration:
    • OS type and version: Debian 12 Bookworm
    • CPU per node: 8
    • Memory per node: workers 18 GB, management 8 GB
    • Disk type: 128 GB
  • Network bandwidth between the nodes: 1 Gbps
  • Underlying Infrastructure: Proxmox
  • Number of Longhorn volumes in the cluster: 0

@innobead innobead added this to the v1.8.0 milestone Sep 2, 2024
@innobead innobead added area/setting Global setting or volume setting investigation-needed Identified the issue but require further investigation for resolution (won't be stale) priority/0 Must be implement or fixed in this release (managed by PO) severity/3 Function working but has a major issue w/ workaround labels Sep 2, 2024
@kreeger
Copy link

kreeger commented Sep 2, 2024

Ah-hah: it was my Helm value for removeSnapshotsDuringFilesystemTrim causing this problem; it needed to be true or false, but not enabled (which is an acceptable value for the persistence section of the values chart, but not the defaultSettings section). I discovered this when writing up a loop of Ansible patch operations to modify the Setting objects after an initial install. With the updated section in my values:

defaultSettings:
  createDefaultDiskLabeledNodes: true
  defaultReplicaCount: '2'
  taintToleration: Storage=true:NoSchedule
  removeSnapshotsDuringFilesystemTrim: true # Note this line!

…my Setting objects are now properly populated, including my taint-toleration Setting.

NAME                                                              VALUE                                         APPLIED   AGE
create-default-disk-labeled-nodes                                 true                                          true      5m13s
default-replica-count                                             2                                             true      5m13s
taint-toleration                                                  Storage=true:NoSchedule                       true      5m13s
remove-snapshots-during-filesystem-trim                           true                                          true      5m12s

So to those having this issue if you're installing from Helm: check the values you have set for all of your defaultSettings keys in your Helm chart; some of them may not map to acceptable Longhorn values or value types.

To the Longhorn maintainers: since these defaultSettings values are picked up and applied the way they are, each key could use additional specific documentation about acceptable values. Many do, but not all!

@KostLinux
Copy link
Author

@mantissahz will it be fixed?
This issue is up already for one year :D

@derekbit derekbit modified the milestones: v1.8.0, v1.9.0 Nov 25, 2024
@mantissahz mantissahz moved this from Analysis and Design to New Issues in Longhorn Sprint Nov 28, 2024
@opethema
Copy link

opethema commented Jan 6, 2025

I can only see the issue when doing helm upgrade.
but with a fresh helm install, the tolerations are added as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install-uninstall-upgrade Install, Uninstall or Upgrade related area/setting Global setting or volume setting investigation-needed Identified the issue but require further investigation for resolution (won't be stale) kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/qa-review-coverage Require QA to review coverage severity/3 Function working but has a major issue w/ workaround
Projects
Status: In Progress
Status: New Issues
Development

No branches or pull requests

8 participants