Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] LH continuously reports invalid customized default setting taint-toleration #4554

Closed
w13915984028 opened this issue Sep 13, 2022 · 10 comments
Assignees
Labels
area/admission-webhook Kubernetes validation and mutating admission webhooks backport/1.3.3 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/1 Highly recommended to implement or fix in this release (managed by PO) reproduce/always 100% reproducible severity/4 Function working but has a minor issue (a minor incident with low impact)
Milestone

Comments

@w13915984028
Copy link

Describe the bug

A clear and concise description of what the bug is.

longhorn-manager continuously prints this error:

time="2022-09-13T10:16:30Z" level=error msg="invalid customized default setting taint-toleration with value kubevirt.io/drain:NoSchedule, will continue applying other customized settings" error="fail to set the setting taint-toleration with invalid value kubevirt.io/drain:NoSchedule: cannot modify toleration setting before all volumes are detached"

After harvester is installed, there are 3 internal PVC (used by rancher-monitoring, created in bootstrap stage ), no user config, system is in default state.

harv2:~ # kk get volume -A
NAMESPACE         NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE    AGE
longhorn-system   pvc-72423bf6-3805-4fb3-827b-eda326127e79   attached   healthy                  53687091200   harv2   20h
longhorn-system   pvc-cc9e1079-f5dc-409d-abd2-7d6b5c4ce23b   attached   healthy                  5368709120    harv2   20h
longhorn-system   pvc-de4ddfba-0d6a-4d7b-82fc-4b917424d025   attached   healthy                  2147483648    harv2   20h
harv2:~ # 

To Reproduce

Steps to reproduce the behavior:

  1. Install Harvester master-head, with LH v1.3.1
  2. After installation is finished, check LH manager log
  3. The error log is always printed.

Expected behavior

A clear and concise description of what you expected to happen.

Such error message should not be there.

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version: V1.3.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): With Harvester master-head, single-node cluster.
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

Related code:

// ValidateSetting checks the given setting value types and condition
func (s *DataStore) ValidateSetting(name, value string) (err error) {
	defer func() {
		err = errors.Wrapf(err, "fail to set the setting %v with invalid value %v", name, value)
	}()
...
	case types.SettingNameTaintToleration:
		list, err := s.ListVolumesRO()
		if err != nil {
			return errors.Wrapf(err, "failed to list volumes before modifying toleration setting")
		}
		for _, v := range list {
			if v.Status.State != longhorn.VolumeStateDetached {
				return fmt.Errorf("cannot modify toleration setting before all volumes are detached")
			}
		}
@joshimoo
Copy link
Contributor

@w13915984028 please share the default settings .yaml/helm file that you used.
As well as the current settings, you can get us the current settings by just uploading a support bundle.

@derekbit
Copy link
Member

@w13915984028
Can you show the value kubectl -n longhorn-system get setting taint-toleration?
If the value is set correctly, then the message is false alarm. We will improve it.

@joshimoo joshimoo added the component/longhorn-manager Longhorn manager (control plane) label Sep 14, 2022
@innobead innobead added severity/4 Function working but has a minor issue (a minor incident with low impact) area/admission-webhook Kubernetes validation and mutating admission webhooks labels Sep 14, 2022
@innobead innobead added this to the v1.4.0 milestone Sep 14, 2022
@innobead innobead added the reproduce/always 100% reproducible label Sep 14, 2022
@w13915984028
Copy link
Author

@derekbit :

time="2022-09-14T07:51:30Z" level=error msg="invalid customized default setting taint-toleration with value kubevirt.io/drain:NoSchedule, will continue applying other customized settings" error="fail to set the setting taint-toleration with invalid value kubevirt.io/drain:NoSchedule: cannot modify toleration setting before all volumes are detached"
harv41:~ # kubectl -n longhorn-system get settings.longhorn.io taint-toleration
NAME               VALUE                          AGE
taint-toleration   kubevirt.io/drain:NoSchedule   12h




harv41:~ # kubectl -n longhorn-system get settings.longhorn.io 
NAME                                                 VALUE                                              AGE
allow-node-drain-with-last-healthy-replica           false                                              12h
allow-recurring-job-while-volume-detached            false                                              12h
allow-volume-creation-with-degraded-availability     true                                               12h
auto-cleanup-system-generated-snapshot               true                                               12h
auto-delete-pod-when-volume-detached-unexpectedly    true                                               12h
auto-salvage                                         true                                               12h
backing-image-cleanup-wait-interval                  60                                                 12h
backing-image-recovery-wait-interval                 300                                                12h
backup-target                                                                                           12h
backup-target-credential-secret                                                                         12h
backupstore-poll-interval                            300                                                12h
concurrent-automatic-engine-upgrade-per-node-limit   3                                                  12h
concurrent-replica-rebuild-per-node-limit            5                                                  12h
crd-api-version                                      longhorn.io/v1beta2                                12h
create-default-disk-labeled-nodes                    false                                              12h
current-longhorn-version                             v1.3.1                                             12h
default-backing-image-manager-image                  longhornio/backing-image-manager:v3_20220808       12h
default-data-locality                                disabled                                           12h
default-data-path                                    /var/lib/harvester/defaultdisk                     12h
default-engine-image                                 longhornio/longhorn-engine:v1.3.1                  12h
default-instance-manager-image                       longhornio/longhorn-instance-manager:v1_20220808   12h
default-longhorn-static-storage-class                longhorn-static                                    12h
default-replica-count                                3                                                  12h
default-share-manager-image                          longhornio/longhorn-share-manager:v1_20220808      12h
disable-replica-rebuild                              false                                              12h
disable-revision-counter                             false                                              12h
disable-scheduling-on-cordoned-node                  true                                               12h
guaranteed-engine-cpu                                                                                   12h
guaranteed-engine-manager-cpu                        12                                                 12h
guaranteed-replica-manager-cpu                       12                                                 12h
kubernetes-cluster-autoscaler-enabled                false                                              12h
latest-longhorn-version                              v1.3.1                                             12h
mkfs-ext4-parameters                                                                                    12h
node-down-pod-deletion-policy                        do-nothing                                         12h
orphan-auto-deletion                                 false                                              12h
priority-class                                                                                          12h
registry-secret                                                                                         12h
replica-auto-balance                                 disabled                                           12h
replica-replenishment-wait-interval                  600                                                12h
replica-soft-anti-affinity                           false                                              12h
replica-zone-soft-anti-affinity                      true                                               12h
stable-longhorn-versions                             v1.1.3,v1.2.5,v1.3.1                               12h
storage-minimal-available-percentage                 25                                                 12h
storage-network                                                                                         12h
storage-over-provisioning-percentage                 200                                                12h
system-managed-components-node-selector                                                                 12h
system-managed-pods-image-pull-policy                if-not-present                                     12h
taint-toleration                                     kubevirt.io/drain:NoSchedule                       12h
upgrade-checker                                      true                                               12h
harv41:~ # 

@derekbit
Copy link
Member

@w13915984028
Thanks for the confirmation.
The taint-toleration is set correctly. Sorry for the misleading message. We will improve this part.

cc @weizhe0422

@innobead innobead added the priority/1 Highly recommended to implement or fix in this release (managed by PO) label Oct 21, 2022
@weizhe0422
Copy link
Contributor

I can reproduce it in my environment.
image

I think it could have a logic to check the setting value was applied, if so then do not validation the value.

func (s *DataStore) filterCustomizedDefaultSettings(customizedDefaultSettings map[string]string) map[string]string {
	availableCustomizedDefaultSettings := make(map[string]string)

	for name, value := range customizedDefaultSettings {
                /*
                   Check if the CustomizedDefaultSettings value exists.
                       If so, then bypass the ValidateSetting function.
                       If not, execute the ValidateSetting function as the original behavior.
                */
		if err := s.ValidateSetting(string(name), value); err == nil {
			availableCustomizedDefaultSettings[name] = value
		} else {
			logrus.WithError(err).Errorf("invalid customized default setting %v with value %v, will continue applying other customized settings", name, value)
		}
	}
	return availableCustomizedDefaultSettings
}

@derekbit @innobead WDYT

@derekbit
Copy link
Member

Sounds good to me.

@innobead
Copy link
Member

Sounds good to me, but we also need to figure out why the config map informer w/ the default resync time instead of 0. Also, suggest the message be changed to warn instead of error, because here it will be recognized on purpose instead of a real error regarded.

@kladiv
Copy link

kladiv commented Oct 21, 2022

+1 same issue.
My helm values are:

    defaultSettings:
      ...
      taintToleration: "node-role.example.com/storage:NoSchedule"
      systemManagedComponentsNodeSelector: "node-role.kubernetes.io/storage:true"
      ...

Settings:

$ kubectl -n longhorn-system get settings | egrep 'system-managed-components-node-selector|taint-toleration'
system-managed-components-node-selector              node-role.kubernetes.io/storage:true                  34h
taint-toleration                                     node-role.example.com/storage:NoSchedule            34h

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Oct 25, 2022

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?

task items

  1. Customized default settings won't be update if setting and value exist
  2. Change the Customized default settings error message to warn instead, because it's not thrown out but just recognized
  3. Disable ConfigMap re-synchronize periodically

The reproduce steps/test steps are at:

  • Step 1. Install Longhorn, e.g, via Helm
    • 1.1 Clone Helm chart to local
    • 1.2 Add tolerance value in values.yaml like below:
          defaultSettings:
            taintToleration: "kubevirt.io/drain:NoSchedule"
          longhornManager:
            tolerations:
            - key: "kubevirt.io/drain"
              operator: "Equal"
              value: ""
              effect: "NoSchedule"
          longhornDriver:
            tolerations:
            - key: "kubevirt.io/drain"
              operator: "Equal"
              value: ""
              effect: "NoSchedule"
          longhornUI:
            tolerations:
            - key: "kubevirt.io/drain"
              operator: "Equal"
              value: ""
              effect: "NoSchedule"
    • 1.3 Start the installation: helm install longhorn . --namespace longhorn-system --create-namespace
  • Step 2. Check if the setting was applied success
    • setting CR: kubectl get settings.longhorn.io taint-toleration -n longhorn-system
    • Instance-manager: kubectl get pod instance-manager-r-047a4c1a -n longhorn-system -o yaml -> spec/tolerations
  • Step 3. Create a volume and attach to a node
  • Step 4. Validate Item 2
    1. type kubectl get cm longhorn-default-setting -n longhorn-system -o yaml > configMap.yaml
    2. Remove taint-toleration in the configMap.yaml like below:
    Screenshot_20221117_163828
    3. Apply configmap.yaml again kubectl apply -f configMap.yaml
    4. type kubectl logs -f longhorn-manager-wzxt6 -n longhorn-system | grep "Invalid customized default setting "
    5. The output reveals warning message like below:
    image
  • Step 5. Detach all volumes
  • Step 6. Update the tolerations value
    • Export current value: kubectl get cm longhorn-default-setting -n longhorn-system -o yaml > default-setting-cm.yaml
    • Modify the value and apply it: kubectl apply -f default-setting-cm.yaml
  • Step 7. Validate Item 1 and Item 3
    • type kubectl logs longhorn-manager-wzxt6 -n longhorn-system | grep "Invalid customized default setting "
    • check the message doesn't keep appearing every 30 seconds

@roger-ryao
Copy link

Verified on master-head 20221117

  • longhorn master-head (dab0603)
  • longhorn-manager master-head (ed1f742)

The test steps

Ref #4554 (To Reproduce)

Result Passed

  1. When volume was attached to a node, we applied the new configmap
    We could get the warning message like below:
    Screenshot_20221117_165231

  2. After volume was detached, we applied the new configmap. We couldn't find out the warning message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/admission-webhook Kubernetes validation and mutating admission webhooks backport/1.3.3 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/1 Highly recommended to implement or fix in this release (managed by PO) reproduce/always 100% reproducible severity/4 Function working but has a minor issue (a minor incident with low impact)
Projects
Status: Closed
Development

No branches or pull requests

8 participants