Skip to content

Commit

Permalink
Merge pull request #3532 from mikebrow/ensure-secret-pulled-images-pe…
Browse files Browse the repository at this point in the history
…rsist

#2535:ensure secret pulled images - move kep targets; add persist
  • Loading branch information
k8s-ci-robot authored Oct 5, 2023
2 parents e31c14e + 013a444 commit 7ae1c1a
Show file tree
Hide file tree
Showing 2 changed files with 119 additions and 36 deletions.
147 changes: 115 additions & 32 deletions keps/sig-node/2535-ensure-secret-pulled-images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Integration tests](#integration-tests)
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Deprecation](#deprecation)
Expand Down Expand Up @@ -62,7 +63,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*

## Summary

We will add support in kubelet for the pullIfNotPresent image pull policy, for
We will add support in kubelet for the `pullIfNotPresent` image pull policy, for
ensuring images pulled with pod imagePullSecrets are re-authenticated for other
pods that do not have the same imagePullSecret/auths used to successfully pull
the images in the first place.
Expand All @@ -89,6 +90,8 @@ order to use a present image.
This means that the image pull policy alwaysPull would no longer be required in
every scenario to ensure image access rights by pods.

*** The issue and these changes improving the security posture without requiring the forcing of pull always, will be documented in the kubernetes image pull policy documentation. The new feature gate should also be documented in release notes. ***

## Motivation

There have been customer requests for improving upon kubernetes' ability to
Expand Down Expand Up @@ -132,10 +135,10 @@ use un-encrypted...

## Proposal

For alpha `kubelet` will keep a list, since boot, of container images that required
authentication and a list of the authentications that successfully pulled the image.
For beta the list will be persisted across reboot of host, and restart of kubelet.
Additionally, an API will be considered to manage the ensure metadata.
For alpha `kubelet` will keep a list, across reboots of host and restart of
kubelet, of container images that required authentication and a list of the
authentications that successfully pulled the image.
For beta an API will be considered to manage the ensure metadata.

`kubelet` will ensure any image in the list is always pulled if an authentication
used is not present, thus enforcing authentication / re-authentication.
Expand All @@ -160,17 +163,17 @@ to set the feature gate to true to gain these this Secure by Default benefit.
### Risks and Mitigations

Image authentications with a registry may expire. To mitigate expirations a
a timeout could be used to force re-authentication. The timeout could be a
a timeout will be used to force re-authentication. The timeout could be a
container runtime feature or a `kubelet` feature. If at the container runtime,
images would not be present during the EnsureImagesExist step, thus would have
to be pulled and authenticated if necessary. This timeout feature will be
implemented in beta.
implemented in alpha.

Since images can be pre-loaded, loaded outside the `kubelet` process, and
garbage collected.. the list of images that required authentication in `kubelet`
will not be a source of truth for how all images were pulled that are in the
container runtime cache. To mitigate, images can be garbage collected at boot.
And for beta, we will persist ensure metadata across reboot of host, and restart
And we will persist ensure metadata across reboot of host, and restart
of kubelet, and possibly look at a way to add ensure metadata for images loaded
outside of kubelet. In beta we will add a switch to enable re-auth on boot for
admins seeking that instead of having to garbage collect where they do not use
Expand All @@ -179,15 +182,47 @@ or expect preloaded images since boot.

## Design Details

Kubelet will track, in memory, a hash map for the credentials that were successfully used to pull an image. The hash map
will not be persisted to disk, in alpha. For alpha explicitly, we will not reuse or add other state manager concepts to kubelet.
Kubelet will track, in memory, a hash map for the credentials that were successfully used to pull an image. It has been decided that the hash map will be persisted to disk, in alpha.

The persisted "cache" will undergo cleanup operations on a timely basis (by default once an hour).

The persistence of the on storage cache is mainly for restarting kubelet and/or node reboot.

The max size of the cache will scale with the number of unique cache entries * the number of unique images that have not been garbage collected. It is not expected that this will be a significant number of bytes. Will be verified by actual use in Alpha and subsequent metrics in Beta.

See `/var/lib/kubelet/image_manager_state` in [kubernetes/kubernetes#114847](https://github.com/kubernetes/kubernetes/pull/114847)

> ```
> {
> "images": {
> "sha256:eb6cbbefef909d52f4b2b29f8972bbb6d86fc9dba6528e65aad4f119ce469f7a": {
> "authHash": { ** per review comment use SHA256 here vs hash **
> "115b8808c3e7f073": {
> "ensured": true,
> "dueDate": "2023-05-30T05:26:53.76740982+08:00"
> }
> },
> "name": "daocloud.io/daocloud/dce-registry-tool:3.0.8"
> }
> }
> }
> ```
See PR for detailed design / behavior documentation.
See PR linked above for detailed design / behavior documentation.
### Test Plan
[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes
necessary to implement this enhancement.
##### Prerequisite testing updates
##### Unit tests
For alpha, exhaustive Kubelet unit tests will be provided. Functions affected by the feature gate will be run with the feature gate on and with the feature gate off. Unit buckets will be provided for:
- HashAuth - (new, small) returns a hash code for a CRI pull image auth [link](https://github.com/kubernetes/kubernetes/pull/94899/files#diff-ca08601dfd2fdf846f066d0338dc332beddd5602ab3a71b8fac95b419842da63R704-R751)
- HashAuth - (new, small) returns a hash code for a CRI pull image auth [link](https://github.com/kubernetes/kubernetes/pull/94899/files#diff-ca08601dfd2fdf846f066d0338dc332beddd5602ab3a71b8fac95b419842da63R704-R751) ** per review comment will use SHA256 **
- shouldPullImage - (modified, large sized change) determines if image should be pulled based on presence, and image pull policy, and now with the feature gate on if the image has been pulled/ensured by a secret. A unit test bucket did not exist for this function. The unit bucket will cover a matrix for:
```
pullIfNotPresent := &v1.Container{
Expand All @@ -214,7 +249,48 @@ For alpha, exhaustive Kubelet unit tests will be provided. Functions affected by
```
[TestShouldPullImage link](https://github.com/kubernetes/kubernetes/pull/94899/files#diff-7297f08c72da9bf6479e80c03b45e24ea92ccb11c0031549e51b51f88a91f813R311-R438)
At beta we should revisit if integration buckets are warranted for e2e node and/or cri-tools/critest, and after gathering feedback.
PersistHashMeta() ** will be persisting SHA256 entries vs hash **
Additionally, for Alpha we will update this readme with an enumeration of the core packages being touched by the PR to implement this enhancement and provide the current unit coverage for those in the form of:
- <package>: <date> - <current test coverage>
The data will be read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
##### Integration tests
At beta we will revisit if integration buckets are warranted for cri-tools/critest, and after gathering feedback.
<!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->
<!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
- <test>: <link to test coverage> (TBD)
##### e2e tests
<!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->
At beta we will revisit if e2e buckets are warranted for e2e node, and after gathering feedback.
- <test>: <link to test coverage> (TBD)
### Graduation Criteria
Expand All @@ -226,16 +302,21 @@ At beta we should revisit if integration buckets are warranted for e2e node and/
#### Deprecation
N/A in alpha
TBD subsequent to alpha
### Upgrade / Downgrade Strategy
### Version Skew Strategy
N/A for alpha
TBD subsequent to alpha
## Production Readiness Review Questionnaire
### Feature Enablement and Rollback
- At Alpha this feature will be disabled by default with a feature gate.
- At Beta this feature will be enabled by default with the feature gate.
- At GA the ability to gate the feature will be removed leaving the feature enabled.
###### How can this feature be enabled / disabled in a live cluster?
Expand All @@ -261,35 +342,37 @@ Will go back to working as designed.
Yes, tests run both enabled and disabled.
### Rollout, Upgrade and Rollback Planning
N/A
TBD
###### How can a rollout or rollback fail? Can it impact already running workloads?
N/A
TBD
###### What specific metrics should inform a rollback?
N/A
TBD needed for Beta
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
N/A
TBD
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
N/A
TBD
### Monitoring Requirements
N/A
TBD
###### How can an operator determine if the feature is in use by workloads?
Can check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
For alpha can check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
using the pull if not present image pull policy. Will show up as network events. Though only the manifests will be
revalidated against the container image repository, large contents will not be pulled. Thus one could monitor traffic
to the registry.
For beta will add metrics allowing an admin to determine how often an image has been reauthenticated to an image registry because of cache expiration or due to reuse across pods that have different authentication information. Success metrics will also be provided highlighting cache hits.
###### How can someone using this feature know that it is working for their instance?
Can test for an image pull failure event coming from a second pod that does not have credentials to pull the image
Expand All @@ -301,27 +384,27 @@ where the image is present and the image pull policy is if not present.
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
N/A
TBD
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
N/A
TBD
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
N/A
TBD needed for Beta
### Dependencies
N/A for alpha
TBD
###### Does this feature depend on any specific services running in the cluster?
No.
### Scalability
N/A
TBD
###### Will enabling / using this feature result in any new API calls?
Expand Down Expand Up @@ -352,27 +435,27 @@ When switched on see above.
### Troubleshooting
N/A
TBD
###### How does this feature react if the API server and/or etcd is unavailable?
N/A
TBD
###### What are other known failure modes?
N/A
TBD
###### What steps should be taken if SLOs are not being met to determine the problem?
Check logs.
## Implementation History
tbd
TBD
## Drawbacks [optional]
Why should this KEP _not_ be implemented. N/A
Why should this KEP _not_ be implemented. TBD
## Alternatives [optional]
Expand All @@ -384,4 +467,4 @@ ensure the image instead of kubelet.
## Infrastructure Needed [optional]
tbd
TBD
8 changes: 4 additions & 4 deletions keps/sig-node/2535-ensure-secret-pulled-images/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ approvers:
- "@dchen1107"
- "@derekwaynecarr"
stage: alpha
latest-milestone: "v1.24"
latest-milestone: "v1.29"
milestone:
alpha: "v1.24"
beta: "v1.25"
stable: "v1.27"
alpha: "v1.29"
beta: "v1.30"
stable: "v1.32"
feature-gates:
- name: KubeletEnsureSecretPulledImages
components:
Expand Down

0 comments on commit 7ae1c1a

Please sign in to comment.