-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Umbrella] Implement improvements to protect CI and releases from changes to release tooling #816
Comments
FWIW, what you can do today is set
This & mock releases should be safe enough and a way forward to start backfilling tests, refactoring, ... Just to be clear: I am not advocating to keep using this and a mock stage & release as "the test". We need way faster feedback. But we can start with that. |
on the side of kubeadm release-master-informing jobs the failures stared happening because the kubeadm jobs could not find the CI artifacts they needs such as:
examining the bucket:
as @dims outlined, the prow job
this is the first kubeadm job failure we saw: i couldn't not map the time to a problematic commit in k/release. |
i think this is a great idea and such an e2e can compensate for the lack of unit tests. the problems i saw - e.g.
indicate issues in the staging process. |
@neolit123 -- that's super useful; thanks for the data!
@hoegaarden -- I didn't know you could do that. Do we have it documented somewhere? |
There is one such presubmit now (pull-release-cluster-up) but that doesn't cover everything used by EG the kubeadm job. |
@hoegaarden @justaugustus that gcbmgr script needs access to release-test-dev bucket too
|
@dims -- Could you clarify what you mean by access to That snippet only shows that you don't have access to the
|
It is not documented, it came in somewhat recently with 636269f. |
You still need access to the |
@justaugustus sorry i typo-ed it :) like @hoegaarden and you pointed out it's access to |
To make sure we capture the various failure modes here, I did a little digging, layered with the diagnoses we've already received from a few on this thread. Download failures
|
Before we dive in, it's helpful to understand what the Here's its' job config: periodics:
<snip>
- interval: 1h
name: ci-kubernetes-build
labels:
preset-service-account: "true"
preset-dind-enabled: "true"
annotations:
fork-per-release: "true"
fork-per-release-replacements: "--extra-publish-file=k8s-master -> --extra-publish-file=k8s-beta"
fork-per-release-generic-suffix: "true"
testgrid-dashboards: sig-release-master-blocking
testgrid-tab-name: build-master
testgrid-alert-email: "kubernetes-release-team@googlegroups.com"
spec:
containers:
- image: gcr.io/k8s-testimages/bootstrap:v20190703-1f4d616
args:
- --repo=k8s.io/kubernetes
- --repo=k8s.io/release
- --root=/go/src
- --timeout=180
- --scenario=kubernetes_build
- --
- --allow-dup
- --extra-publish-file=k8s-master
- --hyperkube
- --registry=gcr.io/kubernetes-ci-images
# docker-in-docker needs privileged mode
securityContext:
privileged: true
resources:
requests:
cpu: 4
memory: "8Gi" Focusing on the imageWe use the bootstrap image for this test.
Looking at the entrypoint, we execute: <snip>
/usr/local/bin/runner.sh \
./test-infra/jenkins/bootstrap.py \
--job="${JOB_NAME}" \
--service-account="${GOOGLE_APPLICATION_CREDENTIALS}" \
--upload='gs://kubernetes-jenkins/logs' \
"$@" We now know that a script called bootstrap.py is run. argsBootstrap takes the following args:
Passing the second set into a scenario (python script which executes a specific test scenario) called kubernetes_build.py. def main(args):
<snip>
push_build_args = ['--nomock', '--verbose', '--ci']
<snip>
check('make', 'clean')
if args.fast:
check('make', 'quick-release')
else:
check('make', 'release')
check(args.push_build_script, *push_build_args) So ultimately,
Let's now take a look at a failure when In this instance, the command executed was: W0702 00:35:33.049] Run: ('../release/push-build.sh', '--nomock', '--verbose', '--ci', '--release-kind=kubernetes', '--docker-registry=gcr.io/kubernetes-ci-images', '--extra-publish-file=k8s-master', '--allow-dup') In a successful CI build, we expect to see the following sections run:
Walking the individual failures: 1
Checking the diff between failure and reverted state ( @@ -136,7 +132,9 @@ gitlib::github_api_token () {
##############################################################################
# Checks github ACLs
# returns 1 on failure
-PROGSTEP[gitlib::github_acls]="CHECK GITHUB AUTH"
+# Disable shellcheck for dynamically defined variable
+# shellcheck disable=SC2154
+export PROGSTEP[gitlib::github_acls]="CHECK GITHUB AUTH"
gitlib::github_acls () {
gitlib::github_api_token || return 1
@@ -161,7 +159,6 @@ gitlib::git_config_for_gcb () { Explanation of # Print PROGSTEPs as bolded headers within scripts.
# PROGSTEP is a globally defined dictionary (associative array) that can
# take a function name or integer as its key
# The function indexes the dictionary in the order the items are added (by
# calling the function) so that progress can be shown during script execution
# (1/4, 2/4...4/4)
# If a PROGSTEP dictionary is empty, common::stepheader() will just show the
# text passed in. FixNone required, post-revert 2I0702 09:02:50.192] push-build.sh::release::gcs::push_release_artifacts(): /google-cloud-sdk/bin/gsutil -qm cp -rc /go/src/k8s.io/kubernetes/_output/gcs-stage/v1.16.0-alpha.0.1787+b3c6e2189576c9/* gs://kubernetes-release-dev/ci/v1.16.0-alpha.0.1787+b3c6e2189576c9/
W0702 09:02:50.292] tar: *: Cannot stat: No such file or directory
W0702 09:02:50.293] tar: Exiting with failure status due to previous errors
W0702 09:02:50.293] tar: /go/src/k8s.io/kubernetes/_output/release-stage/client/-- darwin-386 darwin-amd64 linux-386 linux-amd64 linux-arm linux-arm64 linux-ppc64le linux-s390x windows-386 windows-amd64/kubernetes/client/bin: Cannot open: No such file or directory
W0702 09:02:50.294] tar: Error is not recoverable: exiting now
W0702 09:02:50.294] tar: This does not look like a tar archive
W0702 09:02:50.294] tar: Exiting with failure status due to previous errors
I0702 09:02:52.032] OK Checking the diff between failure and reverted state ( @@ -597,12 +611,12 @@ release::gcs::locally_stage_release_artifacts() {
logecho "Locally stage release artifacts..."
- logrun rm -rf $gcs_stage || return 1
- logrun mkdir -p $gcs_stage || return 1
+ logrun rm -rf "$gcs_stage" || return 1
+ logrun mkdir -p "$gcs_stage" || return 1
# Stage everything in release directory
logecho "- Staging locally to ${gcs_stage##$build_output/}..."
- release::gcs::stage_and_hash $gcs_stage $release_tars/* . || return 1
+ release::gcs::stage_and_hash "$gcs_stage" "$release_tars"/* . || return 1
if [[ "$release_kind" == "kubernetes" ]]; then
local gce_path=$release_stage/full/kubernetes/cluster/gce
@@ -621,20 +635,20 @@ release::gcs::locally_stage_release_artifacts() { release::gcs::stage_and_hash "$gcs_stage" "$release_tars"/* . || return 1 causes the failure. It probably should've been: release::gcs::stage_and_hash "${gcs_stage}" "${release_tars}/*" . || return 1 So here, it's entirely possible that the tars were on the workspace during the CI build but they never made it to GCS. Fix
3
Checking the diff between failure and reverted state ( @@ -931,14 +947,16 @@ release::docker::release () {
logecho "Send docker containers from release-images to $push_registry..."
- arches=($(cd "$release_images"; echo *))
- for arch in ${arches[@]}; do
- for tarfile in $release_images/$arch/*.tar; do
+ mapfile -t arches < <(
+ cd "$release_images" || return 1
+ find . -mindepth 1 -maxdepth 1 -type d | cut -c 3-)
+ for arch in "${arches[@]}"; do
+ for tarfile in "$release_images/$arch"/*.tar; do
# There may be multiple tags; just get the first
- orig_tag=$(tar xf $tarfile manifest.json -O | jq -r '.[0].RepoTags[0]')
+ orig_tag=$(tar xf "$tarfile" manifest.json -O | jq -r '.[0].RepoTags[0]')
if [[ ! "$orig_tag" =~ ^.+/(.+):.+$ ]]; then
logecho "$FAILED: malformed tag in $tarfile:"
- logecho $orig_tag
+ logecho "$orig_tag"
return 1
fi
binary=${BASH_REMATCH[1]}
@@ -947,14 +965,15 @@ release::docker::release () {
Seems to be another miss on globbing here: for tarfile in "$release_images/$arch"/*.tar; do It probably should've been: for tarfile in "${release_images}/${arch}/*.tar"; do Fix
More to come! |
ci-kubernetes-build-canary builds and publishes artifacts to a canary GCS bucket using the tooling on the master branch of kubernetes/release. The goal here is to allow Release Engineering to improve the current toolset without impacting CI for the entire project. While this job should closely mirror the configuration of ci-kubernetes-build, it should differ in a few ways: - runs on CI ref of kubernetes/kubernetes - runs on master branch of kubernetes/release - publishes artifacts to a different GCS bucket - alerts Release Managers instead of the Release Team ref: kubernetes/release/issues/816 Signed-off-by: Stephen Augustus <saugustus@vmware.com>
Adding a canary job to discuss here: kubernetes/test-infra#13340 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
/milestone v1.18 |
/priority important-soon This issue points to release toolbox (the bash=>Go work). Currently this only needs an update and then a close. |
/remove-priority critical-urgent |
Triage conclusion: closing as we've been developing in a different direction and the open items no longer seem to be relevant. If this is not the case, feel free to reopen or carve out the open items via new, updated, rescoped issues. |
(Also sent to k-dev, SIG Release MLs, and SIG Testing: https://groups.google.com/d/topic/kubernetes-dev/OuLMzqZkdtw/discussion)
Following an attempt to improve the semantics of the release tooling via shellcheck (#726), we found that we were unable to stage releases.
Multiple fixes were merged in an attempt to bring us to a usable state.
An unintended and unexpected side effect of this was a cascading failure of multiple release-blocking jobs. A few for example:
Ultimately, it was decided that the right course of action was to revert back to a known good state in the repo (#814) to stop the bleeding.
This implies that, in our current state, it is inadvisable to make any changes to the tooling in this repo.
As such, I'm advising the following course of action (h/t to @nikhita, @liggitt, and @BenTheElder for being a sounding board):
(this will require repo admins to explicitly approve and override the blockade to merge changes to critical tooling)
(they are using something in k/release; we need to figure out what those somethings are)
(this locks in a known good state of k/release that doesn't need to be
master
)master
At this point, we will have gotten to a place where we can safely make changes to k/release without impacting CI. We will then:
For longer term goals, we should seek to:
lib/{common,gitlib,releaselib}
) and call these new tools in the existing release tooling(this allows us to get some immediate benefit of a more robust language w/o having to completely refactor)
(Some historical references: kubernetes/kubernetes#28922, kubernetes/kubernetes#16529, kubernetes/kubernetes#15560, kubernetes/kubernetes#8686)
Please take this an initial assessment of the situation and feel free to provide feedback. :)
/assign
/milestone v1.16
/area release-eng
/sig release
/kind bug
/priority critical-urgent
cc @kubernetes/sig-release-admins @kubernetes/release-engineering @dims @neolit123 @pswica
The text was updated successfully, but these errors were encountered: