Failure cluster [c199077e...]: Pod InPlace Resize Container pod-resize-resource-quota-test #128840

pohly · 2024-11-18T13:11:53Z

Failure cluster c199077e785ee1358fcf

Error text:

[FAILED] exceeded quota: resize-resource-quota, requested: memory=350Mi, used: memory=700Mi, limited: memory=800Mi
Expected an error to have occurred.  Got:
    <nil>: nil
In [It] at: k8s.io/kubernetes/test/e2e/node/pod_resize.go:172 @ 11/13/24 12:32:16.302

Another failure:

container[c1] status resources mismatch: Expected
    <v1.ResourceRequirements>: 
        limits:
          cpu: 300m
          memory: 300Mi
        requests:
          cpu: 300m
          memory: 300Mi
to equal
    <v1.ResourceRequirements>: 
        limits:
          cpu: 400m
          memory: 400Mi
        requests:
          cpu: 400m
          memory: 400Mi

Recent failures:

11/18/2024, 3:48:51 AM ci-kubernetes-e2e-gci-gce
11/17/2024, 5:34:50 PM ci-kubernetes-e2e-gci-gce-ip-alias
11/17/2024, 11:42:35 AM ci-containerd-e2e-ubuntu-gce
11/17/2024, 5:48:18 AM ci-kubernetes-e2e-kind-alpha-beta-features
11/16/2024, 12:34:32 PM ci-containerd-e2e-ubuntu-gce

/kind failing-test
/kind flake
/sig node

The text was updated successfully, but these errors were encountered:

esotsal · 2024-11-20T19:13:31Z

/triage accepted

esotsal · 2024-11-20T19:24:34Z

/assign

hshiina · 2024-11-22T15:18:04Z

Though I haven't reached the root cause, this issue doesn't look to have been caused by InPlacePodVerticalScaling feature.

In ci-kubernetes-e2e-gci-gce #1858341567471816704:

Referring to kube-controller-manager.log, Resource Quota Monitor detected pod updates:

I1118 03:11:10.954180      11 resource_quota_monitor.go:364] "QuotaMonitor process object" logger="resourcequota-controller" resource="/v1, Resource=pods" namespace="pod-resize-resource-quota-test-1773" name="testpod1" uid="e84bdf17-9172-4237-ac7d-fbd80fdd2a3c" eventType="update"
<snip>
I1118 03:11:11.789725      11 resource_quota_monitor.go:364] "QuotaMonitor process object" logger="resourcequota-controller" resource="/v1, Resource=pods" namespace="pod-resize-resource-quota-test-1773" name="testpod2" uid="d31d201c-21c9-4336-9489-d664e716ea7b" eventType="update"
<snip>
I1118 03:11:13.266386      11 resource_quota_monitor.go:364] "QuotaMonitor process object" logger="resourcequota-controller" resource="/v1, Resource=pods" namespace="pod-resize-resource-quota-test-1773" name="testpod1" uid="e84bdf17-9172-4237-ac7d-fbd80fdd2a3c" eventType="update"
<snip>
I1118 03:11:14.074603      11 resource_quota_monitor.go:364] "QuotaMonitor process object" logger="resourcequota-controller" resource="/v1, Resource=pods" namespace="pod-resize-resource-quota-test-1773" name="testpod1" uid="e84bdf17-9172-4237-ac7d-fbd80fdd2a3c" eventType="update"

However, Resource Quota Controller could not update the resource quota immediately:

I1118 03:11:28.055522      11 resource_quota_controller.go:333] "Finished syncing resource quota" logger="resourcequota-controller" queueKey="pod-resize-resource-quota-test-1773/resize-resource-quota" key="pod-resize-resource-quota-test-1773/resize-resource-quota" duration="8.68877ms"

It looks that Sync() of Resource Quota Controller got stuck at WaitForNamedCacheSync() with acquiring workerLock:

kubernetes/pkg/controller/resourcequota/resource_quota_controller.go

Lines 471 to 500 in 35d098a

    
           rq.workerLock.Lock() 
        
           defer rq.workerLock.Unlock() 
        
           // Something has changed, so track the new state and perform a sync. 
        
           if loggerV := logger.V(2); loggerV.Enabled() { 
        
           	loggerV.Info("syncing resource quota controller with updated resources from discovery", "diff", printDiff(oldResources, newResources)) 
        
           } 
        
           // Perform the monitor resync and wait for controllers to report cache sync. 
        
           if err := rq.resyncMonitors(ctx, newResources); err != nil { 
        
           	utilruntime.HandleError(fmt.Errorf("failed to sync resource monitors: %v", err)) 
        
           	return 
        
           } 
        
           // at this point, we've synced the new resources to our monitors, so record that fact. 
        
           oldResources = newResources 
        
           // wait for caches to fill for a while (our sync period). 
        
           // this protects us from deadlocks where available resources changed and one of our informer caches will never fill. 
        
           // informers keep attempting to sync in the background, so retrying doesn't interrupt them. 
        
           // the call to resyncMonitors on the reattempt will no-op for resources that still exist. 
        
           if rq.quotaMonitor != nil && 
        
           	!cache.WaitForNamedCacheSync( 
        
           		"resource quota", 
        
           		waitForStopOrTimeout(ctx.Done(), period), 
        
           		func() bool { return rq.quotaMonitor.IsSynced(ctx) }, 
        
           	) { 
        
           	utilruntime.HandleError(fmt.Errorf("timed out waiting for quota monitor sync")) 
        
           	return 
        
           }

I1118 03:10:58.043208      11 resource_quota_controller.go:476] "syncing resource quota controller with updated resources from discovery" logger="resourcequota-controller" diff="added: [crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds], removed: []"
<snip>
I1118 03:10:58.043524      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
I1118 03:10:58.144676      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
I1118 03:11:27.944287      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
E1118 03:11:28.045524      11 resource_quota_controller.go:498] "Unhandled Error" err="timed out waiting for quota monitor sync" logger="UnhandledError"
I1118 03:11:28.055522      11 resource_quota_controller.go:333] "Finished syncing resource quota" logger="resourcequota-controller" queueKey="pod-resize-resource-quota-test-1773/resize-resource-quota" key="pod-resize-resource-quota-test-1773/resize-resource-quota" duration="8.68877ms"

So, workers that updates resource quotas in Resource Quota Controller got blocked at locking:

kubernetes/pkg/controller/resourcequota/resource_quota_controller.go

Lines 263 to 270 in 35d098a

    
           rq.workerLock.RLock() 
        
           defer rq.workerLock.RUnlock() 
        
           logger := klog.FromContext(ctx) 
        
           logger = klog.LoggerWithValues(logger, "queueKey", key) 
        
           ctx = klog.NewContext(ctx, logger) 
        
           err := rq.syncHandler(ctx, key)

Sync() had gotten stuck before pod-resize-resource-quota-test started.

pohly · 2024-11-22T15:40:43Z

The "Expected an error to have occurred. Got: : nil" looks suspiciously like the flake that I got in an E2E test for DRA. Does the test assume that creating some object is reflected in the v1.ResourceQuota immediately and then quota is enforced by the apiserver?

Further debugging of the DRA failure led to the (so far tentative) conclusion that the quota mechanism is not perfectly reliable - see #128899 (comment).

hshiina · 2024-11-22T21:46:10Z

At a glance, there seems to be another race with resource quota. This issue looks to have been caused by a test of CRDValidationRatcheting:

kubernetes/test/e2e/apimachinery/crd_validation_ratcheting.go

Line 44 in e4c1f98

    
           var _ = SIGDescribe("CRDValidationRatcheting [Privileged:ClusterAdmin]", framework.WithFeatureGate(apiextensionsfeatures.CRDValidationRatcheting), func() {

In kube-apiserver.log, CRD e2e-test-crd-validation-ratche ting-1729-crds was being deleted while Resource Quota Controller was being stuck:

I1118 03:10:58.970709      10 httplog.go:134] "HTTP" verb="DELETE" URI="/apis/apiextensions.k8s.io/v1/customresourcedefinitions/e2e-test-crd-validation-ratche
ting-1729-crds.crd-validation-ratcheting.example.com" latency="12.902084ms" userAgent="e2e.test/v1.32.0 (linux/amd64) kubernetes/cf480a3 -- [sig-api-machinery
] CRDValidationRatcheting [Privileged:ClusterAdmin] [FeatureGate:CRDValidationRatcheting] [Beta] MUST fail to update a resource due to JSONSchema errors on ch
anged fields" audit-ID="849f2686-857e-4c60-92d3-3f775a03c1f3" srcIP="34.123.190.167:41820" apf_pl="exempt" apf_fs="exempt" apf_iseats=1 apf_fseats=0 apf_addit
ionalLatency="0s" apf_execution_time="12.255503ms" resp=200
<snip>
I1118 03:10:59.056142      10 controller.go:277] Deleting customresourcedefinition "e2e-test-crd-validation-ratcheting-1729-crds.crd-validation-ratcheting.example.com"
I1118 03:10:59.056875      10 customresource_handler.go:553] Removing dead CRD storage for crd-validation-ratcheting.example.com/e2e-test-crd-validation-ratcheting-1729-crd
I1118 03:10:59.056919      10 customresource_discovery_controller.go:391] Deleting customresourcedefinition "e2e-test-crd-validation-ratcheting-1729-crds.crd-validation-ratcheting.example.com"
I1118 03:10:59.057745      10 handler.go:286] Adding GroupVersion crd-validation-ratcheting.example.com v1 to ResourceManager
I1118 03:10:59.056948      10 controller.go:296] Deleting customresourcedefinition "e2e-test-crd-validation-ratcheting-1729-crds.crd-validation-ratcheting.example.com"

As described in #128840 (comment), Resource Quota Controller looks to have been trying to sync the deleted resource e2e-test-crd-validation-ratcheting-1729-crds:

I1118 03:10:58.043208      11 resource_quota_controller.go:476] "syncing resource quota controller with updated resources from discovery" logger="resourcequota-controller" diff="added: [crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds], removed: []"
<snip>
I1118 03:10:58.043524      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
I1118 03:10:58.144676      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
I1118 03:11:27.944287      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
E1118 03:11:28.045524      11 resource_quota_controller.go:498] "Unhandled Error" err="timed out waiting for quota monitor sync" logger="UnhandledError"

k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 18, 2024

github-project-automation bot moved this to Triage in SIG Node CI/Test Board Nov 18, 2024

github-project-automation bot added this to SIG Node CI/Test Board Nov 18, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 18, 2024

pohly mentioned this issue Nov 18, 2024

DRA: enable performance tracking with scheduler_perf #128836

Open

esotsal mentioned this issue Nov 20, 2024

testing: Fix pod delete timeout failures after InPlacePodVerticalScaling Graduate to Beta commit #128880

Open

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2024

k8s-ci-robot assigned esotsal Nov 20, 2024

esotsal moved this from Triage to Issues - In progress in SIG Node CI/Test Board Nov 20, 2024

tallclair added this to SIG Node: In Place Pod Vertical Scaling Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure cluster [c199077e...]: Pod InPlace Resize Container pod-resize-resource-quota-test #128840

Failure cluster [c199077e...]: Pod InPlace Resize Container pod-resize-resource-quota-test #128840

pohly commented Nov 18, 2024

esotsal commented Nov 20, 2024

esotsal commented Nov 20, 2024

hshiina commented Nov 22, 2024

pohly commented Nov 22, 2024

hshiina commented Nov 22, 2024

Failure cluster [c199077e...]: Pod InPlace Resize Container pod-resize-resource-quota-test #128840

Failure cluster [c199077e...]: Pod InPlace Resize Container pod-resize-resource-quota-test #128840

Comments

pohly commented Nov 18, 2024

Failure cluster c199077e785ee1358fcf

Error text:

Recent failures:

esotsal commented Nov 20, 2024

esotsal commented Nov 20, 2024

hshiina commented Nov 22, 2024

pohly commented Nov 22, 2024

hshiina commented Nov 22, 2024