[BUG] LH continuously reports `invalid customized default setting taint-toleration` #4554

w13915984028 · 2022-09-13T17:42:20Z

Describe the bug

A clear and concise description of what the bug is.

longhorn-manager continuously prints this error:

time="2022-09-13T10:16:30Z" level=error msg="invalid customized default setting taint-toleration with value kubevirt.io/drain:NoSchedule, will continue applying other customized settings" error="fail to set the setting taint-toleration with invalid value kubevirt.io/drain:NoSchedule: cannot modify toleration setting before all volumes are detached"

After harvester is installed, there are 3 internal PVC (used by rancher-monitoring, created in bootstrap stage ), no user config, system is in default state.

harv2:~ # kk get volume -A
NAMESPACE         NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE    AGE
longhorn-system   pvc-72423bf6-3805-4fb3-827b-eda326127e79   attached   healthy                  53687091200   harv2   20h
longhorn-system   pvc-cc9e1079-f5dc-409d-abd2-7d6b5c4ce23b   attached   healthy                  5368709120    harv2   20h
longhorn-system   pvc-de4ddfba-0d6a-4d7b-82fc-4b917424d025   attached   healthy                  2147483648    harv2   20h
harv2:~ #

To Reproduce

Steps to reproduce the behavior:

Install Harvester master-head, with LH v1.3.1
After installation is finished, check LH manager log
The error log is always printed.

Expected behavior

A clear and concise description of what you expected to happen.

Such error message should not be there.

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

Longhorn version: V1.3.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): With Harvester master-head, single-node cluster.
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

Related code:

// ValidateSetting checks the given setting value types and condition
func (s *DataStore) ValidateSetting(name, value string) (err error) {
	defer func() {
		err = errors.Wrapf(err, "fail to set the setting %v with invalid value %v", name, value)
	}()
...
	case types.SettingNameTaintToleration:
		list, err := s.ListVolumesRO()
		if err != nil {
			return errors.Wrapf(err, "failed to list volumes before modifying toleration setting")
		}
		for _, v := range list {
			if v.Status.State != longhorn.VolumeStateDetached {
				return fmt.Errorf("cannot modify toleration setting before all volumes are detached")
			}
		}

The text was updated successfully, but these errors were encountered:

joshimoo · 2022-09-13T17:53:32Z

@w13915984028 please share the default settings .yaml/helm file that you used.
As well as the current settings, you can get us the current settings by just uploading a support bundle.

derekbit · 2022-09-13T23:50:36Z

@w13915984028
Can you show the value kubectl -n longhorn-system get setting taint-toleration?
If the value is set correctly, then the message is false alarm. We will improve it.

w13915984028 · 2022-09-14T07:53:04Z

@derekbit :

time="2022-09-14T07:51:30Z" level=error msg="invalid customized default setting taint-toleration with value kubevirt.io/drain:NoSchedule, will continue applying other customized settings" error="fail to set the setting taint-toleration with invalid value kubevirt.io/drain:NoSchedule: cannot modify toleration setting before all volumes are detached"

harv41:~ # kubectl -n longhorn-system get settings.longhorn.io taint-toleration
NAME               VALUE                          AGE
taint-toleration   kubevirt.io/drain:NoSchedule   12h




harv41:~ # kubectl -n longhorn-system get settings.longhorn.io 
NAME                                                 VALUE                                              AGE
allow-node-drain-with-last-healthy-replica           false                                              12h
allow-recurring-job-while-volume-detached            false                                              12h
allow-volume-creation-with-degraded-availability     true                                               12h
auto-cleanup-system-generated-snapshot               true                                               12h
auto-delete-pod-when-volume-detached-unexpectedly    true                                               12h
auto-salvage                                         true                                               12h
backing-image-cleanup-wait-interval                  60                                                 12h
backing-image-recovery-wait-interval                 300                                                12h
backup-target                                                                                           12h
backup-target-credential-secret                                                                         12h
backupstore-poll-interval                            300                                                12h
concurrent-automatic-engine-upgrade-per-node-limit   3                                                  12h
concurrent-replica-rebuild-per-node-limit            5                                                  12h
crd-api-version                                      longhorn.io/v1beta2                                12h
create-default-disk-labeled-nodes                    false                                              12h
current-longhorn-version                             v1.3.1                                             12h
default-backing-image-manager-image                  longhornio/backing-image-manager:v3_20220808       12h
default-data-locality                                disabled                                           12h
default-data-path                                    /var/lib/harvester/defaultdisk                     12h
default-engine-image                                 longhornio/longhorn-engine:v1.3.1                  12h
default-instance-manager-image                       longhornio/longhorn-instance-manager:v1_20220808   12h
default-longhorn-static-storage-class                longhorn-static                                    12h
default-replica-count                                3                                                  12h
default-share-manager-image                          longhornio/longhorn-share-manager:v1_20220808      12h
disable-replica-rebuild                              false                                              12h
disable-revision-counter                             false                                              12h
disable-scheduling-on-cordoned-node                  true                                               12h
guaranteed-engine-cpu                                                                                   12h
guaranteed-engine-manager-cpu                        12                                                 12h
guaranteed-replica-manager-cpu                       12                                                 12h
kubernetes-cluster-autoscaler-enabled                false                                              12h
latest-longhorn-version                              v1.3.1                                             12h
mkfs-ext4-parameters                                                                                    12h
node-down-pod-deletion-policy                        do-nothing                                         12h
orphan-auto-deletion                                 false                                              12h
priority-class                                                                                          12h
registry-secret                                                                                         12h
replica-auto-balance                                 disabled                                           12h
replica-replenishment-wait-interval                  600                                                12h
replica-soft-anti-affinity                           false                                              12h
replica-zone-soft-anti-affinity                      true                                               12h
stable-longhorn-versions                             v1.1.3,v1.2.5,v1.3.1                               12h
storage-minimal-available-percentage                 25                                                 12h
storage-network                                                                                         12h
storage-over-provisioning-percentage                 200                                                12h
system-managed-components-node-selector                                                                 12h
system-managed-pods-image-pull-policy                if-not-present                                     12h
taint-toleration                                     kubevirt.io/drain:NoSchedule                       12h
upgrade-checker                                      true                                               12h
harv41:~ #

derekbit · 2022-09-14T11:37:51Z

@w13915984028
Thanks for the confirmation.
The taint-toleration is set correctly. Sorry for the misleading message. We will improve this part.

cc @weizhe0422

weizhe0422 · 2022-10-21T08:41:27Z

I can reproduce it in my environment.

I think it could have a logic to check the setting value was applied, if so then do not validation the value.

func (s *DataStore) filterCustomizedDefaultSettings(customizedDefaultSettings map[string]string) map[string]string {
	availableCustomizedDefaultSettings := make(map[string]string)

	for name, value := range customizedDefaultSettings {
                /*
                   Check if the CustomizedDefaultSettings value exists.
                       If so, then bypass the ValidateSetting function.
                       If not, execute the ValidateSetting function as the original behavior.
                */
		if err := s.ValidateSetting(string(name), value); err == nil {
			availableCustomizedDefaultSettings[name] = value
		} else {
			logrus.WithError(err).Errorf("invalid customized default setting %v with value %v, will continue applying other customized settings", name, value)
		}
	}
	return availableCustomizedDefaultSettings
}

@derekbit @innobead WDYT

derekbit · 2022-10-21T08:43:24Z

Sounds good to me.

innobead · 2022-10-21T09:16:25Z

Sounds good to me, but we also need to figure out why the config map informer w/ the default resync time instead of 0. Also, suggest the message be changed to warn instead of error, because here it will be recognized on purpose instead of a real error regarded.

kladiv · 2022-10-21T20:26:01Z

+1 same issue.
My helm values are:

    defaultSettings:
      ...
      taintToleration: "node-role.example.com/storage:NoSchedule"
      systemManagedComponentsNodeSelector: "node-role.kubernetes.io/storage:true"
      ...

Settings:

$ kubectl -n longhorn-system get settings | egrep 'system-managed-components-node-selector|taint-toleration'
system-managed-components-node-selector              node-role.kubernetes.io/storage:true                  34h
taint-toleration                                     node-role.example.com/storage:NoSchedule            34h

longhorn-io-github-bot · 2022-10-25T09:08:08Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?

task items

Customized default settings won't be update if setting and value exist
Change the Customized default settings error message to warn instead, because it's not thrown out but just recognized
Disable ConfigMap re-synchronize periodically

The reproduce steps/test steps are at:

Step 1. Install Longhorn, e.g, via Helm

1.1 Clone Helm chart to local

1.2 Add tolerance value in values.yaml like below:

    defaultSettings:
      taintToleration: "kubevirt.io/drain:NoSchedule"
    longhornManager:
      tolerations:
      - key: "kubevirt.io/drain"
        operator: "Equal"
        value: ""
        effect: "NoSchedule"
    longhornDriver:
      tolerations:
      - key: "kubevirt.io/drain"
        operator: "Equal"
        value: ""
        effect: "NoSchedule"
    longhornUI:
      tolerations:
      - key: "kubevirt.io/drain"
        operator: "Equal"
        value: ""
        effect: "NoSchedule"

1.3 Start the installation: helm install longhorn . --namespace longhorn-system --create-namespace

Step 2. Check if the setting was applied success
- setting CR: kubectl get settings.longhorn.io taint-toleration -n longhorn-system
- Instance-manager: kubectl get pod instance-manager-r-047a4c1a -n longhorn-system -o yaml -> spec/tolerations
Step 3. Create a volume and attach to a node
Step 4. Validate Item 2
1. type kubectl get cm longhorn-default-setting -n longhorn-system -o yaml > configMap.yaml
2. Remove taint-toleration in the configMap.yaml like below:

3. Apply configmap.yaml again kubectl apply -f configMap.yaml
4. type kubectl logs -f longhorn-manager-wzxt6 -n longhorn-system | grep "Invalid customized default setting "
5. The output reveals warning message like below:
Step 5. Detach all volumes
Step 6. Update the tolerations value
- Export current value: kubectl get cm longhorn-default-setting -n longhorn-system -o yaml > default-setting-cm.yaml
- Modify the value and apply it: kubectl apply -f default-setting-cm.yaml
Step 7. Validate Item 1 and Item 3
- type kubectl logs longhorn-manager-wzxt6 -n longhorn-system | grep "Invalid customized default setting "
- check the message doesn't keep appearing every 30 seconds

Does the PR include the explanation for the fix or the feature?
fix: Custom default settings and values will not be updated if they exist longhorn-manager#1530 (comment)
- Customized default settings won't be update if setting and value exist
- Change the Customized default settings error message to warn instead, because it's not thrown out but just recognized
- Disable ConfigMap re-synchronize periodically
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at https://github.com/longhorn/longhorn-manager/pull/1530/files

roger-ryao · 2022-11-17T09:02:56Z

Verified on master-head 20221117

longhorn master-head (dab0603)
longhorn-manager master-head (ed1f742)

The test steps

Ref #4554 (To Reproduce)

Result Passed

When volume was attached to a node, we applied the new configmap
We could get the warning message like below:
After volume was detached, we applied the new configmap. We couldn't find out the warning message

w13915984028 added the kind/bug label Sep 13, 2022

joshimoo added the component/longhorn-manager Longhorn manager (control plane) label Sep 14, 2022

innobead assigned weizhe0422 Sep 14, 2022

innobead added severity/4 Function working but has a minor issue (a minor incident with low impact) area/admission-webhook Kubernetes validation and mutating admission webhooks labels Sep 14, 2022

innobead added this to the v1.4.0 milestone Sep 14, 2022

innobead added the reproduce/always 100% reproducible label Sep 14, 2022

derekbit mentioned this issue Oct 20, 2022

[BUG] logs spamming with messages invalid customized default setting ... #4760

Closed

innobead added the priority/1 Highly recommended to implement or fix in this release (managed by PO) label Oct 21, 2022

innobead added the backport/1.3.3 label Oct 21, 2022

github-actions bot mentioned this issue Oct 21, 2022

[BACKPORT][v1.3.3][BUG] LH continuously reports invalid customized default setting taint-toleration #4769

Closed

weizhe0422 mentioned this issue Oct 25, 2022

fix: Custom default settings and values will not be updated if they exist longhorn/longhorn-manager#1530

Merged

weizhe0422 mentioned this issue Nov 14, 2022

[BACKPORT][v1.3.3] fix: Custom default settings and values will not be updated if they exist longhorn/longhorn-manager#1542

Merged

khushboo-rancher assigned roger-ryao Nov 15, 2022

roger-ryao closed this as completed Nov 17, 2022

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] LH continuously reports `invalid customized default setting taint-toleration` #4554

[BUG] LH continuously reports `invalid customized default setting taint-toleration` #4554

w13915984028 commented Sep 13, 2022

joshimoo commented Sep 13, 2022

derekbit commented Sep 13, 2022

w13915984028 commented Sep 14, 2022

derekbit commented Sep 14, 2022

weizhe0422 commented Oct 21, 2022

derekbit commented Oct 21, 2022

innobead commented Oct 21, 2022

kladiv commented Oct 21, 2022

longhorn-io-github-bot commented Oct 25, 2022 •

edited by roger-ryao

Loading

roger-ryao commented Nov 17, 2022

[BUG] LH continuously reports invalid customized default setting taint-toleration #4554

[BUG] LH continuously reports invalid customized default setting taint-toleration #4554

Comments

w13915984028 commented Sep 13, 2022

Describe the bug

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

joshimoo commented Sep 13, 2022

derekbit commented Sep 13, 2022

w13915984028 commented Sep 14, 2022

derekbit commented Sep 14, 2022

weizhe0422 commented Oct 21, 2022

derekbit commented Oct 21, 2022

innobead commented Oct 21, 2022

kladiv commented Oct 21, 2022

longhorn-io-github-bot commented Oct 25, 2022 • edited by roger-ryao Loading

Pre Ready-For-Testing Checklist

task items

roger-ryao commented Nov 17, 2022

[BUG] LH continuously reports `invalid customized default setting taint-toleration` #4554

[BUG] LH continuously reports `invalid customized default setting taint-toleration` #4554

longhorn-io-github-bot commented Oct 25, 2022 •

edited by roger-ryao

Loading