Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-44438: Remove non-matching feature-gated CVO manifests from payload #5093

Merged
merged 3 commits into from
Dec 12, 2024

Conversation

petr-muller
Copy link
Member

@petr-muller petr-muller commented Nov 8, 2024

What this PR does / why we need it:

CVO manifests contain some feature-gated ones:

  • since at least 4.16, there are feature-gated ClusterVersion CRDs
  • UpdateStatus feature is delivered through DevPreview (now) and TechPreview (later) feature set

We observed HyperShift CI jobs to fail when adding DevPreview-gated deployment manifests, which was unexpected. Investigating further, we discovered that HyperShift applies these manifests:

cluster-version-operator-665c5789d5-8sr59-bootstrap.log:

error: error parsing /var/payload/manifests/0000_00_update-status-controller_03_deployment-DevPreviewNoUpgrade.yaml: error converting YAML to JSON: yaml: invalid map key: map[interface {}]interface {}{".ReleaseImage":interface {}(nil)}

But even without these added manifests, this happens for existing ClusterVersion CRD manifests present in the payload:

$ ls -1 manifests/*clusterversions*crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-CustomNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-Default.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-DevPreviewNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-TechPreviewNoUpgrade.crd.yaml

In a passing HyperShift CI job, the same log shows that all four manifests are applied instead of just one:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured

This likely means that HyperShift hosted clusters end up using TechPreviewNoUpgrade ClusterVersion CRD?

The proper fix is probably to wire through the FeatureGate with desired featureset through CVOParams but we would need a bit of selection logic so for now we can just remove entropy by deleting all feature-gated manifests instead of stumbling at them.

Which issue(s) this PR fixes:

OCPBUGS-44438

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@openshift-ci openshift-ci bot requested review from csrwng and hasueki November 8, 2024 18:27
@openshift-ci openshift-ci bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels Nov 8, 2024
@petr-muller
Copy link
Member Author

/retest

@petr-muller
Copy link
Member Author

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/5093/pull-ci-openshift-hypershift-main-e2e-aks/1855048275799838720#1:build-log.txt%3A154-164

         --------------------------------------------------------------------------------
        RESPONSE 409: 409 Conflict
        ERROR CODE: StorageAccountAlreadyTaken
        --------------------------------------------------------------------------------
        {
          "error": {
            "code": "StorageAccountAlreadyTaken",
            "message": "The storage account named clusterpdlcb is already taken."
          }
        }
        -------------------------------------------------------------------------------- 

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/5093/pull-ci-openshift-hypershift-main-e2e-kubevirt-aws-ovn-reduced/1855048276131188736#1:build-log.txt%3A57

failed to acquire lease for "hypershift-quota-slice": status 502 Bad Gateway, status code 502 

/test e2e-aks e2e-kubevirt-aws-ovn-reduced

@petr-muller petr-muller changed the title Remove feature-gated CVO manifests from payload OCPBUGS-44438: Remove feature-gated CVO manifests from payload Nov 11, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 11, 2024
@openshift-ci-robot
Copy link

@petr-muller: This pull request references Jira Issue OCPBUGS-44438, which is invalid:

  • expected the bug to target the "4.18.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

What this PR does / why we need it:

CVO manifests contain some feature-gated ones:

  • since at least 4.16, there are feature-gated ClusterVersion CRDs
  • UpdateStatus feature is delivered through DevPreview (now) and TechPreview (later) feature set

We observed HyperShift CI jobs to fail when adding DevPreview-gated deployment manifests, which was unexpected. Investigating further, we discovered that HyperShift applies these manifests:

cluster-version-operator-665c5789d5-8sr59-bootstrap.log:

error: error parsing /var/payload/manifests/0000_00_update-status-controller_03_deployment-DevPreviewNoUpgrade.yaml: error converting YAML to JSON: yaml: invalid map key: map[interface {}]interface {}{".ReleaseImage":interface {}(nil)}

But even without these added manifests, this happens for existing ClusterVersion CRD manifests present in the payload:

$ ls -1 manifests/*clusterversions*crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-CustomNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-Default.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-DevPreviewNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-TechPreviewNoUpgrade.crd.yaml

In a passing HyperShift CI job, the same log shows that all four manifests are applied instead of just one:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured

This likely means that HyperShift hosted clusters end up using TechPreviewNoUpgrade ClusterVersion CRD?

The proper fix is probably to wire through the FeatureGate with desired featureset through CVOParams but we would need a bit of selection logic so for now we can just remove entropy by deleting all feature-gated manifests instead of stumbling at them.

Which issue(s) this PR fixes:

I can file an OCPBUGS if this PR looks sane

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@petr-muller
Copy link
Member Author

/hold

Holding until branch cut

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 11, 2024
@petr-muller
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 26, 2024
@petr-muller petr-muller changed the title OCPBUGS-44438: Remove feature-gated CVO manifests from payload OCPBUGS-44438: Remove non-matching feature-gated CVO manifests from payload Nov 26, 2024
@petr-muller
Copy link
Member Author

petr-muller commented Nov 26, 2024

I have adopted Seth's approach from #5096. It is closer to desired behavior than my original proposal, I just tweaked it so it removes all non-matching manifests, instead of relying on a Default always present and the feature set one applied over it (which would not happen for CustomNoUpgrade).

This means that CVO will need to always provide a manifest for a given featureset if a certain resource needs to be applied, it cannot rely on Default one. Fortunately this is exactly the convention that even generated CRD follows. CVO manifests will need to either provide a non-gated manifest (without a Default/TechPreviewNoUpdate/... which will always be applied, or to provide a gated manifest for each featureset where the resource needs to be present.

@petr-muller
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 26, 2024
@openshift-ci-robot
Copy link

@petr-muller: This pull request references Jira Issue OCPBUGS-44438, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jiezhao@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@muraee
Copy link
Contributor

muraee commented Nov 26, 2024

please also add those changes here:

func preparePayloadScript(platformType hyperv1.PlatformType, oauthEnabled bool) string {

@petr-muller petr-muller force-pushed the devpreview-cvo branch 3 times, most recently from f3a038d to 895e5be Compare November 28, 2024 19:36
Copy link
Contributor

openshift-ci bot commented Dec 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sjenning
Copy link
Contributor

sjenning commented Dec 4, 2024

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 857ccab and 2 for PR HEAD d075673 in total

@petr-muller
Copy link
Member Author

/retest-required

petr-muller and others added 3 commits December 4, 2024 23:51
CVO manifests contain some feature-gated ones:
- since at least 4.16, there are feature-gated `ClusterVersion` CRDs
- `UpdateStatus` feature is delivered through DevPreview (now) and
  TechPreview (later) feature set

We observed HyperShift CI jobs to fail when adding DevPreview-gated
deployment manifests, which was unexpected. Investigating further,
we discovered that HyperShift applies these manifests:

[cluster-version-operator-665c5789d5-8sr59-bootstrap.log](https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1091/pull-ci-openshift-cluster-version-operator-master-e2e-hypershift-conformance/1853764751801192448/artifacts/e2e-hypershift-conformance/dump/artifacts/namespaces/clusters-c01d0e18fc19f1e0757b/core/pods/logs/cluster-version-operator-665c5789d5-8sr59-bootstrap.log):

```
error: error parsing /var/payload/manifests/0000_00_update-status-controller_03_deployment-DevPreviewNoUpgrade.yaml: error converting YAML to JSON: yaml: invalid map key: map[interface {}]interface {}{".ReleaseImage":interface {}(nil)}
```

But even without these added manifests, this happens for existing `ClusterVersion`
CRD manifests present in the payload:

```console
ls -1 manifests/*clusterversions*crd.yaml
 manifests/0000_00_cluster-version-operator_01_clusterversions-CustomNoUpgrade.crd.yaml
 manifests/0000_00_cluster-version-operator_01_clusterversions-Default.crd.yaml
 manifests/0000_00_cluster-version-operator_01_clusterversions-DevPreviewNoUpgrade.crd.yaml
 manifests/0000_00_cluster-version-operator_01_clusterversions-TechPreviewNoUpgrade.crd.yaml
```

In a passing HyperShift CI job, the [same log](https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1105/pull-ci-openshift-cluster-version-operator-master-e2e-hypershift-conformance/1854708481941049344/artifacts/e2e-hypershift-conformance/dump/artifacts/namespaces/clusters-a5d1b5c3fcb2445935f2/core/pods/logs/cluster-version-operator-96cdfbf7c-cxt9r-bootstrap.log)
shows that all four manifests are applied instead of just one:

```
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
```

This likely means that HyperShift hosted clusters end up using TechPreviewNoUpgrade
`ClusterVersion` CRD?

Fix the problem by deleting all manifests whose filename indicates that
it is a part of a featureset not matching the desired one.

Co-authored-by: Seth Jennings <sjenning@redhat.com>
Co-authored-by: Petr Muller <muller@redhat.com>
The fixtures are exact copies of the original, default featureset ones. This is to allow adding the new test separately so it can update the default fixtures and produce a clear difference from the defaults.
Fixtures updated with:

```
$ UPDATE=true go test ./control-plane-operator/controllers/hostedcontrolplane/... ./control-plane-operator/controllers/hostedcontrolplane/...

```
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2024
@petr-muller
Copy link
Member Author

@petr-muller
Copy link
Member Author

/retest-required

Copy link
Contributor

openshift-ci bot commented Dec 5, 2024

@petr-muller: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test e2e-aws
/test e2e-aws-4-18
/test e2e-aws-upgrade-hypershift-operator
/test e2e-kubevirt-aws-ovn-reduced
/test images
/test security
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test e2e-aks
/test e2e-aws-metrics
/test e2e-azure-aks-ovn-conformance
/test e2e-conformance
/test e2e-kubevirt-aws-ovn
/test e2e-kubevirt-azure-ovn
/test e2e-kubevirt-metal-conformance
/test e2e-openstack
/test e2e-openstack-conformance
/test e2e-openstack-csi-cinder
/test e2e-openstack-csi-manila
/test e2e-openstack-nfv
/test okd-scos-e2e-aws-ovn
/test okd-scos-images

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-hypershift-main-e2e-aws
pull-ci-openshift-hypershift-main-e2e-aws-upgrade-hypershift-operator
pull-ci-openshift-hypershift-main-e2e-kubevirt-aws-ovn-reduced
pull-ci-openshift-hypershift-main-images
pull-ci-openshift-hypershift-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-hypershift-main-security
pull-ci-openshift-hypershift-main-unit
pull-ci-openshift-hypershift-main-verify

In response to this:

/retest required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@petr-muller
Copy link
Member Author

/retest-required

1 similar comment
@petr-muller
Copy link
Member Author

/retest-required

@petr-muller
Copy link
Member Author

/test all

@petr-muller
Copy link
Member Author

/test e2e-aws-upgrade-hypershift-operator

Failed just on destroy 🤞

@sjenning
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2024
@petr-muller
Copy link
Member Author

/retest

@petr-muller
Copy link
Member Author

LOL I was not expecting Konflux to piggyback on Prow commands

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 07b4f6f and 2 for PR HEAD 8374ac1 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 1c35133 and 2 for PR HEAD 8374ac1 in total

@petr-muller
Copy link
Member Author

/retest-required

Copy link
Contributor

openshift-ci bot commented Dec 12, 2024

@petr-muller: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aks d075673 link false /test e2e-aks
ci/prow/okd-scos-e2e-aws-ovn 8374ac1 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit c5e01a9 into openshift:main Dec 12, 2024
11 of 12 checks passed
@openshift-ci-robot
Copy link

@petr-muller: Jira Issue OCPBUGS-44438: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-44438 has not been moved to the MODIFIED state.

In response to this:

What this PR does / why we need it:

CVO manifests contain some feature-gated ones:

  • since at least 4.16, there are feature-gated ClusterVersion CRDs
  • UpdateStatus feature is delivered through DevPreview (now) and TechPreview (later) feature set

We observed HyperShift CI jobs to fail when adding DevPreview-gated deployment manifests, which was unexpected. Investigating further, we discovered that HyperShift applies these manifests:

cluster-version-operator-665c5789d5-8sr59-bootstrap.log:

error: error parsing /var/payload/manifests/0000_00_update-status-controller_03_deployment-DevPreviewNoUpgrade.yaml: error converting YAML to JSON: yaml: invalid map key: map[interface {}]interface {}{".ReleaseImage":interface {}(nil)}

But even without these added manifests, this happens for existing ClusterVersion CRD manifests present in the payload:

$ ls -1 manifests/*clusterversions*crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-CustomNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-Default.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-DevPreviewNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-TechPreviewNoUpgrade.crd.yaml

In a passing HyperShift CI job, the same log shows that all four manifests are applied instead of just one:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured

This likely means that HyperShift hosted clusters end up using TechPreviewNoUpgrade ClusterVersion CRD?

The proper fix is probably to wire through the FeatureGate with desired featureset through CVOParams but we would need a bit of selection logic so for now we can just remove entropy by deleting all feature-gated manifests instead of stumbling at them.

Which issue(s) this PR fixes:

OCPBUGS-44438

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@petr-muller petr-muller deleted the devpreview-cvo branch December 12, 2024 12:09
@petr-muller
Copy link
Member Author

/jira refresh

@openshift-ci-robot
Copy link

@petr-muller: Jira Issue OCPBUGS-44438: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-44438 has been moved to the MODIFIED state.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: hypershift
This PR has been included in build ose-hypershift-container-v4.19.0-202412121208.p0.gc5e01a9.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants