Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FG:InPlacePodVerticalScaling] pull-kubernetes-e2e-capz-windows-master test fail with InPlacePodVerticalScaling Beta #128897

Open
esotsal opened this issue Nov 20, 2024 · 20 comments · Fixed by #129214 · May be fixed by #128936 or #129242
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@esotsal
Copy link

esotsal commented Nov 20, 2024

Which jobs are failing?

pull-kubernetes-e2e-capz-windows-master

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/128880/pull-kubernetes-e2e-capz-windows-master/1859240005449289728

Which tests are failing?

Kubernetes e2e suite: [It] [sig-storage]

  • Projected secret should be consumable from pods in volume [NodeConformance] [Conformance]
    -Secrets should be consumable from pods in volume [NodeConformance] [Conformance]
    -ConfigMap should be consumable from pods in volume with mappings [NodeConformance] [Conformance] expand_more 23s
    -Secrets should be consumable from pods in volume with mappings [NodeConformance] [Conformance] expand_more 33s
    -Projected configMap should be consumable from pods in volume with mappings [NodeConformance] [Conformance] expand_more 25s
  • Projected secret should be able to mount in a volume regardless of a different secret existing with same name in different namespace [NodeConformance] expand_more 25s
  • Projected configMap should be consumable from pods in volume as non-root [NodeConformance] [Conformance] expand_more 27s
  • Projected configMap should be consumable from pods in volume [NodeConformance] [Conformance] expand_more 31s
  • Projected secret should be consumable in multiple volumes in a pod [NodeConformance] [Conformance] expand_more 27s
  • Projected secret should be consumable from pods in volume with mappings [NodeConformance] [Conformance] expand_more 27s
  • Secrets should be able to mount in a volume regardless of a different secret existing with same name in different namespace [NodeConformance] [Conformance] expand_more 25s
  • Secrets should be consumable in multiple volumes in a pod [NodeConformance] [Conformance] expand_more 27s
  • ConfigMap should be consumable from pods in volume [NodeConformance] [Conformance] expand_more 1m9s
  • ConfigMap should be consumable from pods in volume as non-root [NodeConformance] [Conformance] expand_more 25s
  • ConfigMap should be consumable from pods in volume with mappings as non-root [NodeConformance] [Conformance] expand_more 27s
  • Projected configMap should be consumable from pods in volume with mappings as non-root [NodeConformance] [Conformance] expand_more

Kubernetes e2e suite: [It] [sig-windows] [Feature:Windows]

  • Windows volume mounts check volume mount permissions container should have readOnly permissions on emptyDir expand_more 35s

Kubernetes e2e suite: [It] [sig-node]

  • Container Runtime blackbox test on terminated container should report termination message if TerminationMessagePath is set as non-root user and at a non-default path [NodeConformance] [Conformance] expand_more 10m3s

Since when has it been failing?

Last successful 12 November 20:20 1856416666905219072
https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-capz-master-windows/1856416666905219072

First failed 12 November 23:20 1856461965182898176
https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-capz-master-windows/1856461965182898176

Testgrid link

https://testgrid.k8s.io/sig-release-master-informing#capz-windows-master

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Relevant SIG(s)

/sig-node /sig-storage /sig-windows

@esotsal esotsal added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Nov 20, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2024
@dims
Copy link
Member

dims commented Nov 20, 2024

capz-windows-master has recovered https://testgrid.k8s.io/sig-release-master-informing#capz-windows-master&width=20

@kannon92
Copy link
Contributor

I think this ticket is for investigating why inplace broke this job.

/triage accepted
/priority important-soon

This was the main reason why I was +1 on the revert as I think we had a decent understanding of the test problem but this one is perplexing to me.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 21, 2024
@kannon92
Copy link
Contributor

/sig node windows

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 21, 2024
@esotsal
Copy link
Author

esotsal commented Nov 21, 2024

This was the main reason why I was +1 on the revert as I think we had a decent understanding of the test problem but this one is perplexing to me.

Correct , the kubetest2 failure discussed here is also the other reason . We need to solve both those problems to put back InPlacePodVerticalScaling beta . I strongly think kubetest2 relate with test code .

/cc @tallclair @AnishShah

@esotsal
Copy link
Author

esotsal commented Nov 21, 2024

This was the main reason why I was +1 on the revert as I think we had a decent understanding of the test problem but this one is perplexing to me.

Correct , the kubetest2 failure discussed here is also the other reason . We need to solve both those problems to put back InPlacePodVerticalScaling beta . I strongly think kubetest2 relate with test code .

/cc @tallclair @AnishShah

Please check here and here updates for kubetest2 .

@carlory
Copy link
Member

carlory commented Nov 22, 2024

I do some tests on Windows via #128876 and #128927.

Create a pod with a configmap volume and check the permission of the volume.

It shows that the permission of the volume is correct. the expected permission is rwxrwxr-x and the actual permission is rwxrwxr-x. https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/128876/pull-kubernetes-e2e-capz-windows-master/1859544395364175872/artifacts/windows-tests-4712/pod-configmaps-99afca0f-ec2f-4eb1-84d9-cff78011da98/agnhost-container/logs.txt

Path   : Microsoft.PowerShell.Core\FileSystem::C:\etc\configmap-volume\path\to\
         data-2
Owner  : BUILTIN\Administrators
Group  : NT AUTHORITY\SYSTEM
Access : NT AUTHORITY\Authenticated Users Allow  Modify, Synchronize
         NT AUTHORITY\SYSTEM Allow  FullControl
         BUILTIN\Administrators Allow  FullControl
         BUILTIN\Users Allow  ReadAndExecute, Synchronize
Audit  : 
Sddl   : O:BAG:SYD:AI(A;ID;0x1301bf;;;AU)(A;ID;FA;;;SY)(A;ID;FA;;;BA)(A;ID;0x12
         00a9;;;BU)

It shows that the permission of the volume is incorrect. the expected permission is rwxrwxr-x but the actual permission is rwx---r-x. https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/128927/pull-kubernetes-e2e-capz-windows-master/1859786417295593472/artifacts/windows-tests-8460/pod-configmaps-38456fd5-e545-48fc-94c0-465f699797b1/agnhost-container/logs.txt

Path   : Microsoft.PowerShell.Core\FileSystem::C:\etc\configmap-volume\path\to\
         data-2
Owner  : BUILTIN\Administrators
Group  : NT AUTHORITY\SYSTEM
Access : BUILTIN\Administrators Allow  FullControl
         S-1-5-21-3344892312-2374761293-1590377129-513 Allow  ReadAndExecute, 
         Synchronize
         BUILTIN\Users Allow  ReadAndExecute, Synchronize
Audit  : 
Sddl   : O:BAG:SYD:AI(A;ID;FA;;;BA)(A;ID;0x1200a9;;;S-1-5-21-3344892312-2374761
         293-1590377129-513)(A;ID;0x1200a9;;;BU)

We can see the difference on access control list. I didn't know the reason why other tests affected the directory permission. #128880 may not fix the root cause of the issue.

cc @esotsal @pacoxu @aojea

@esotsal
Copy link
Author

esotsal commented Nov 22, 2024

Thanks @carlory , comparing runs from your tests, i am a bit puzzled now. Checking your PR runs without IPPVS and PR runs with IPPVS .

image

If i understand correctly , you don't have successful runs with IPPVs, but in #12880 we had some green runs , does this mean that it might not fix the root cause , but could it be that reduces the flakiness?. Does this mean that InPlacePodVerticalScaling is not the root cause but InPlacePodVerticalScaling increases the number of flake tests ? .

@carlory
Copy link
Member

carlory commented Nov 22, 2024

does this mean that it might not fix the root cause , but could it be that reduces the flakiness?.

@esotsal Yes, it is. #128880 only changes the test code, not the feature itself. Users may encounter the same issue when their clusters enable IPPVs.

To be honest, I'm very confused about why #128880 reduces the flakiness or resolves this issue, but the fact is that it does because the ci is green again.

The Access list I mentioned in the previous comment may help some developers understand what happened in the test.

@hshiina
Copy link
Contributor

hshiina commented Nov 22, 2024

Though I'm not sure this is related to the failures, there are the following messages in kubelet.log in failed cases:

E1120 14:36:37.747170    2972 kuberuntime_manager.go:681] "Unable to get resource configuration" pod="cloud-node-manager-windows-n9klf"
E1120 14:36:37.747170    2972 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to SyncPod: unable to get resource configuration processing resize for pod cloud-node-manager-windows-n9klf" pod="kube-system/cloud-node-manager-windows-n9klf" podUID="fe6bd320-6e2e-481f-9377-c55cb58a4907"

It looks that doPodResizeAction() was called unexpectedly on windows:

func (m *kubeGenericRuntimeManager) doPodResizeAction(pod *v1.Pod, podContainerChanges podActions, result *kubecontainer.PodSyncResult) {
pcm := m.containerManager.NewPodContainerManager()
//TODO(vinaykul,InPlacePodVerticalScaling): Figure out best way to get enforceMemoryQoS value (parameter #4 below) in platform-agnostic way
podResources := cm.ResourceConfigForPod(pod, m.cpuCFSQuota, uint64((m.cpuCFSQuotaPeriod.Duration)/time.Microsecond), false)
if podResources == nil {
klog.ErrorS(nil, "Unable to get resource configuration", "pod", pod.Name)
result.Fail(fmt.Errorf("unable to get resource configuration processing resize for pod %s", pod.Name))
return
}

There may be something wrong with this feature regarding Windows.

@esotsal
Copy link
Author

esotsal commented Nov 22, 2024

There may be something wrong with this feature regarding Windows.

Thanks @hshiina , definitely this is not correct, windows are not supported for InPlacePodVerticalScaling ! Perhaps #128936 might fix this ? /cc @tallclair @vinaykul

https://github.com/kubernetes/kubernetes/pull/128936/files

You have just aswered the question of this issue @hshiina , thanks!

@esotsal
Copy link
Author

esotsal commented Nov 22, 2024

Should we increase priority to critical-urgent @pacoxu ?

Even if InPlacePodVerticalScaling is not Beta , if end user enables InPlacePodVerticalScaling it will have the same consequences on a Windows system.

@kannon92
Copy link
Contributor

Should we increase priority to critical-urgent @pacoxu ?

Given that the feature is alpha, we don't cherry-pick bugs for alpha features so I don't see the need at that moment.

/priority important-soon

This means we want this for 1.33 which I think is what we want.

@esotsal
Copy link
Author

esotsal commented Nov 22, 2024

Should we increase priority to critical-urgent @pacoxu ?

Given that the feature is alpha, we don't cherry-pick bugs for alpha features so I don't see the need at that moment.

/priority important-soon

This means we want this for 1.33 which I think is what we want.

Thanks for clarifying!

@esotsal esotsal changed the title pull-kubernetes-e2e-capz-windows-master failed with InPlacePodVerticalScaling Beta? [FG:InPlacePodVerticalScaling] pull-kubernetes-e2e-capz-windows-master fail with InPlacePodVerticalScaling Beta Nov 28, 2024
@esotsal esotsal changed the title [FG:InPlacePodVerticalScaling] pull-kubernetes-e2e-capz-windows-master fail with InPlacePodVerticalScaling Beta [FG:InPlacePodVerticalScaling] pull-kubernetes-e2e-capz-windows-master test fail with InPlacePodVerticalScaling Beta Nov 28, 2024
@kannon92 kannon92 moved this from Triage to Issues - In progress in SIG Node CI/Test Board Dec 4, 2024
@pacoxu
Copy link
Member

pacoxu commented Dec 12, 2024

Is this possible to be related to #129083?

  • I try to do some test based on current master.

@pacoxu
Copy link
Member

pacoxu commented Dec 12, 2024

Is this possible to be related to #129083?

  • I try to do some test based on current master.

Confirmed. The issue is not related. Current master + enabling VPA beta will fail for https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/128040/pull-kubernetes-e2e-capz-windows-master/1867106924625924096.

@tallclair
Copy link
Member

I think #129214 should address this.

@AnishShah
Copy link
Contributor

/reopen due to #129217

@k8s-ci-robot k8s-ci-robot reopened this Dec 14, 2024
@k8s-ci-robot
Copy link
Contributor

@AnishShah: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@github-project-automation github-project-automation bot moved this from Done to Triage in SIG Node CI/Test Board Dec 14, 2024
@kannon92 kannon92 moved this from Triage to Issues - In progress in SIG Node CI/Test Board Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Issues - In progress
Status: Done
9 participants