Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure cluster [c199077e...]: Pod InPlace Resize Container pod-resize-resource-quota-test #128840

Open
pohly opened this issue Nov 18, 2024 · 5 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@pohly
Copy link
Contributor

pohly commented Nov 18, 2024

Failure cluster c199077e785ee1358fcf

Error text:
[FAILED] exceeded quota: resize-resource-quota, requested: memory=350Mi, used: memory=700Mi, limited: memory=800Mi
Expected an error to have occurred.  Got:
    <nil>: nil
In [It] at: k8s.io/kubernetes/test/e2e/node/pod_resize.go:172 @ 11/13/24 12:32:16.302

Another failure:

container[c1] status resources mismatch: Expected
    <v1.ResourceRequirements>: 
        limits:
          cpu: 300m
          memory: 300Mi
        requests:
          cpu: 300m
          memory: 300Mi
to equal
    <v1.ResourceRequirements>: 
        limits:
          cpu: 400m
          memory: 400Mi
        requests:
          cpu: 400m
          memory: 400Mi

Recent failures:

11/18/2024, 3:48:51 AM ci-kubernetes-e2e-gci-gce
11/17/2024, 5:34:50 PM ci-kubernetes-e2e-gci-gce-ip-alias
11/17/2024, 11:42:35 AM ci-containerd-e2e-ubuntu-gce
11/17/2024, 5:48:18 AM ci-kubernetes-e2e-kind-alpha-beta-features
11/16/2024, 12:34:32 PM ci-containerd-e2e-ubuntu-gce

/kind failing-test
/kind flake
/sig node

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 18, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 18, 2024
@esotsal
Copy link

esotsal commented Nov 20, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2024
@esotsal
Copy link

esotsal commented Nov 20, 2024

/assign

@hshiina
Copy link
Contributor

hshiina commented Nov 22, 2024

Though I haven't reached the root cause, this issue doesn't look to have been caused by InPlacePodVerticalScaling feature.

In ci-kubernetes-e2e-gci-gce #1858341567471816704:

Referring to kube-controller-manager.log, Resource Quota Monitor detected pod updates:

I1118 03:11:10.954180      11 resource_quota_monitor.go:364] "QuotaMonitor process object" logger="resourcequota-controller" resource="/v1, Resource=pods" namespace="pod-resize-resource-quota-test-1773" name="testpod1" uid="e84bdf17-9172-4237-ac7d-fbd80fdd2a3c" eventType="update"
<snip>
I1118 03:11:11.789725      11 resource_quota_monitor.go:364] "QuotaMonitor process object" logger="resourcequota-controller" resource="/v1, Resource=pods" namespace="pod-resize-resource-quota-test-1773" name="testpod2" uid="d31d201c-21c9-4336-9489-d664e716ea7b" eventType="update"
<snip>
I1118 03:11:13.266386      11 resource_quota_monitor.go:364] "QuotaMonitor process object" logger="resourcequota-controller" resource="/v1, Resource=pods" namespace="pod-resize-resource-quota-test-1773" name="testpod1" uid="e84bdf17-9172-4237-ac7d-fbd80fdd2a3c" eventType="update"
<snip>
I1118 03:11:14.074603      11 resource_quota_monitor.go:364] "QuotaMonitor process object" logger="resourcequota-controller" resource="/v1, Resource=pods" namespace="pod-resize-resource-quota-test-1773" name="testpod1" uid="e84bdf17-9172-4237-ac7d-fbd80fdd2a3c" eventType="update"

However, Resource Quota Controller could not update the resource quota immediately:

I1118 03:11:28.055522      11 resource_quota_controller.go:333] "Finished syncing resource quota" logger="resourcequota-controller" queueKey="pod-resize-resource-quota-test-1773/resize-resource-quota" key="pod-resize-resource-quota-test-1773/resize-resource-quota" duration="8.68877ms"

It looks that Sync() of Resource Quota Controller got stuck at WaitForNamedCacheSync() with acquiring workerLock:

rq.workerLock.Lock()
defer rq.workerLock.Unlock()
// Something has changed, so track the new state and perform a sync.
if loggerV := logger.V(2); loggerV.Enabled() {
loggerV.Info("syncing resource quota controller with updated resources from discovery", "diff", printDiff(oldResources, newResources))
}
// Perform the monitor resync and wait for controllers to report cache sync.
if err := rq.resyncMonitors(ctx, newResources); err != nil {
utilruntime.HandleError(fmt.Errorf("failed to sync resource monitors: %v", err))
return
}
// at this point, we've synced the new resources to our monitors, so record that fact.
oldResources = newResources
// wait for caches to fill for a while (our sync period).
// this protects us from deadlocks where available resources changed and one of our informer caches will never fill.
// informers keep attempting to sync in the background, so retrying doesn't interrupt them.
// the call to resyncMonitors on the reattempt will no-op for resources that still exist.
if rq.quotaMonitor != nil &&
!cache.WaitForNamedCacheSync(
"resource quota",
waitForStopOrTimeout(ctx.Done(), period),
func() bool { return rq.quotaMonitor.IsSynced(ctx) },
) {
utilruntime.HandleError(fmt.Errorf("timed out waiting for quota monitor sync"))
return
}

I1118 03:10:58.043208      11 resource_quota_controller.go:476] "syncing resource quota controller with updated resources from discovery" logger="resourcequota-controller" diff="added: [crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds], removed: []"
<snip>
I1118 03:10:58.043524      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
I1118 03:10:58.144676      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
I1118 03:11:27.944287      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
E1118 03:11:28.045524      11 resource_quota_controller.go:498] "Unhandled Error" err="timed out waiting for quota monitor sync" logger="UnhandledError"
I1118 03:11:28.055522      11 resource_quota_controller.go:333] "Finished syncing resource quota" logger="resourcequota-controller" queueKey="pod-resize-resource-quota-test-1773/resize-resource-quota" key="pod-resize-resource-quota-test-1773/resize-resource-quota" duration="8.68877ms"

So, workers that updates resource quotas in Resource Quota Controller got blocked at locking:

rq.workerLock.RLock()
defer rq.workerLock.RUnlock()
logger := klog.FromContext(ctx)
logger = klog.LoggerWithValues(logger, "queueKey", key)
ctx = klog.NewContext(ctx, logger)
err := rq.syncHandler(ctx, key)

Sync() had gotten stuck before pod-resize-resource-quota-test started.

@pohly
Copy link
Contributor Author

pohly commented Nov 22, 2024

The "Expected an error to have occurred. Got: : nil" looks suspiciously like the flake that I got in an E2E test for DRA. Does the test assume that creating some object is reflected in the v1.ResourceQuota immediately and then quota is enforced by the apiserver?

Further debugging of the DRA failure led to the (so far tentative) conclusion that the quota mechanism is not perfectly reliable - see #128899 (comment).

@hshiina
Copy link
Contributor

hshiina commented Nov 22, 2024

At a glance, there seems to be another race with resource quota. This issue looks to have been caused by a test of CRDValidationRatcheting:

var _ = SIGDescribe("CRDValidationRatcheting [Privileged:ClusterAdmin]", framework.WithFeatureGate(apiextensionsfeatures.CRDValidationRatcheting), func() {

In kube-apiserver.log, CRD e2e-test-crd-validation-ratche ting-1729-crds was being deleted while Resource Quota Controller was being stuck:

I1118 03:10:58.970709      10 httplog.go:134] "HTTP" verb="DELETE" URI="/apis/apiextensions.k8s.io/v1/customresourcedefinitions/e2e-test-crd-validation-ratche
ting-1729-crds.crd-validation-ratcheting.example.com" latency="12.902084ms" userAgent="e2e.test/v1.32.0 (linux/amd64) kubernetes/cf480a3 -- [sig-api-machinery
] CRDValidationRatcheting [Privileged:ClusterAdmin] [FeatureGate:CRDValidationRatcheting] [Beta] MUST fail to update a resource due to JSONSchema errors on ch
anged fields" audit-ID="849f2686-857e-4c60-92d3-3f775a03c1f3" srcIP="34.123.190.167:41820" apf_pl="exempt" apf_fs="exempt" apf_iseats=1 apf_fseats=0 apf_addit
ionalLatency="0s" apf_execution_time="12.255503ms" resp=200
<snip>
I1118 03:10:59.056142      10 controller.go:277] Deleting customresourcedefinition "e2e-test-crd-validation-ratcheting-1729-crds.crd-validation-ratcheting.example.com"
I1118 03:10:59.056875      10 customresource_handler.go:553] Removing dead CRD storage for crd-validation-ratcheting.example.com/e2e-test-crd-validation-ratcheting-1729-crd
I1118 03:10:59.056919      10 customresource_discovery_controller.go:391] Deleting customresourcedefinition "e2e-test-crd-validation-ratcheting-1729-crds.crd-validation-ratcheting.example.com"
I1118 03:10:59.057745      10 handler.go:286] Adding GroupVersion crd-validation-ratcheting.example.com v1 to ResourceManager
I1118 03:10:59.056948      10 controller.go:296] Deleting customresourcedefinition "e2e-test-crd-validation-ratcheting-1729-crds.crd-validation-ratcheting.example.com"

As described in #128840 (comment), Resource Quota Controller looks to have been trying to sync the deleted resource e2e-test-crd-validation-ratcheting-1729-crds:

I1118 03:10:58.043208      11 resource_quota_controller.go:476] "syncing resource quota controller with updated resources from discovery" logger="resourcequota-controller" diff="added: [crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds], removed: []"
<snip>
I1118 03:10:58.043524      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
I1118 03:10:58.144676      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
I1118 03:11:27.944287      11 resource_quota_monitor.go:294] "quota monitor not synced" logger="resourcequota-controller" resource="crd-validation-ratcheting.example.com/v1, Resource=e2e-test-crd-validation-ratcheting-1729-crds"
<snip>
E1118 03:11:28.045524      11 resource_quota_controller.go:498] "Unhandled Error" err="timed out waiting for quota monitor sync" logger="UnhandledError"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Issues - In progress
Development

No branches or pull requests

4 participants