- Release Signoff Checklist
- Acknowledgements
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This proposal is heavily based off of this community enhancement, as the problem was never addressed. The purpose of this document is to modernize the proposal: both in the sense of process--updating the doc to meet the new KEP guidelines, as well as in the sense of implementation--updating the proposal to be about changing the CRI instead of the now dropped dockershim.
A lot of credit goes to the authors of the previous proposal.
The purpose of this KEP is to outline changes to the Kubelet and Container Runtime Interface (CRI) that update the way the Kubelet updates changes to pod state to a List/Watch model that polls less frequently reducing overhead. Specifically, the Kubelet will listen for gRPC server streaming events from the CRI implementation for events required for generating pod lifecycle events.
The overarching goal of this effort is to reduce the Kubelet and CRI implementation's steady state CPU usage.
In Kubernetes, Kubelet is a per-node daemon that manages the pods on the node, driving the pod states to match their pod specifications (specs). To achieve this, Kubelet needs to react to changes in both (1) pod specs and (2) the container states. For the former, Kubelet watches the pod specs changes from multiple sources; for the latter, Kubelet polls the container runtime periodically for the latest states for all containers. the current hardcoded default value is 1s.
Polling incurs non-negligible overhead as the number of pods/containers increases, and is exacerbated by Kubelet's parallelism -- one worker (goroutine) per pod, which queries the container runtime individually. Periodic, concurrent, large number of requests causes high CPU usage spikes (even when there is no spec/state change), poor performance, and reliability problems due to overwhelmed container runtime. Ultimately, it limits Kubelet's scalability.
- Reduce unnecessary work during inactivty (no spec/state changes)
- In other words, reduce steady-state CPU usage of Kubelet and CRI implementation by reducing frequent polling of the container statuses.
- Completely eliminate polling altogether.
- This proposal does not advocate completely removing the polling. We cannot solely rely on the upstream container events due to the possibility of missing events. PLEG should relist at reduced frequency to ensure no events are missed.
- Addressing container image relisting via CRI events is out of scope for this enhancement at this point in time.
This proposal aims to replace the periodic polling with a pod lifecycle event watcher. Currently, the Kubelet calls into three CRI calls of the form List*
: ListContainers, ListPodSandbox. Each of these is used to populate the Kubelet's perspective
of the state of the node.
As the number of pods on a node increases, the amount of time the Kubelet and CRI implementation takes in generating and reading this list increases linearly. What is needed is a way of the Kubelet being notified when a container changes state in a way it did not trigger.
There should only be two such cases, and in normal operation, only one would happen frequently:
- The first, and most clear case of a container changing state without the Kubelet triggering that state change is when a container stops. Containers can exit gracefully, or be OOM killed, and the Kubelet would not know.
- We will also introduce events when the container is created as well as is started. This will help us reduce the relisting that takes placed while the kubelet waits for the container to start.
- Although kubelet initiates the container deletion, for sake of increased validation we are also introducing the event to denote that from the runtime.
- The second, and less likely case is when another entity comes and changes the state of the node.
- For container related events (such as a container creating, starting, stopping or being killed), this can appear as a user calling crictl manually, or even using the runtime directly.
The Kubelet currently covers each of thse cases quite easily: by listing all of the resources on the node, it will have an accurate picture after the amount of time of its poll interval. For each of these cases, a new CRI-based events API can be made, using gRPC server streaming. This way, the entity closest to the activity of the containers and pods (the CRI implementation) can be responsible for informing the Kubelet of their behavior directly.
- As a cluster administrator I want to enable
Evented PLEG
feature of the kubelet for better performance with as little infrastructure overhead as possible.
- PLEG is very core to the container status handling in the kubelet. Hence any miscalculation there would result in unpredictable behaviour not just for the node but for an entire cluster.
- To reduce the risk of regression, this feature initially will be available only as an opt-in.
- Users can disable this feature to make kubelet use existing relisting based PLEG.
- Another risk is the CRI implementation could have a buggy event emitting system, and miss pod lifecycle events.
- A mitigation is a
kube_pod_missed_events
metric, which the Kubelet could report when a lifecycle event is registered that wasn't triggered by an event, but rather by changes of state between lists. - While using the Evented implementation, the periodic relisting functionality would still be used with an increased interval which should work as a fallback mechanism for missed events in case of any disruptions.
- Evented PLEG will need to update global cache timestamp periodically in order to make sure pod workers don't get stuck at GetNewerThan in case Evented PLEG misses the event for any unforeseen reason.
- A mitigation is a
Kubelet generates PodLifecycleEvent using relisting. These PodLifecycleEvents
get used in kubelet's sync loop to infer the state of the container. e.g. to determine if the container has died.
The idea behind this enhancment is, kubelet will receive the CRI events mentioned above from the CRI runtime and generate the corresponding PodLifecycleEvent
. This will reduce kubelet's dependency on relisting to generate PodLifecycleEvent
and that event will be immediately available within sync loop instead of waiting for relisting to finish. Kubelet will still do relisting but with a reduced frequency.
This feature can only be used when EventedPLEG
feature gate is enabled.
Kubelet cache saves the pod status with the timestamp. The value of this timestamp is calculated within the kubelet process. This works fine when there is only Generic PLEG at work as it will calculate the timestamp first and then fetch the PodStatus
to save it in the cache.
As of today, the PodStatus
is saved in the cache without any validation of the existing status against the current timestamp. This works well when there is only Generic PLEG
setting the PodStatus
in the cache.
If we have multiple entities, such as Evented PLEG
, while trying to set the PodStatus
in the cache we may run into the racy timestamps given each of them were to calculate the timestamps in their respective execution flow. While Generic PLEG
calculates this timestamp and gets the PodStatus
, we can only calculate the corresponding timestamp in Evented PLEG
after the event has been received by the Kubelet. Any disruptions in getting the events, such as errors in the grpc connection, might skew our calculation of the time in the kubelet for the Evented PLEG
.
In order to address the issues above, we propose that existing Generic PLEG
as well as Evented PLEG
should rely on the CRI Runtime for the timestamp of the PodStatus
. This way the PodStatus
would also be a bit more closer to the actual time when the statuses of the Sandboxes
and Containers
where provided by the CRI Runtime. It will enable us to correctly compare the timestamps before saving them in the cache, to avoid the erroneous behaviour. This should also prevent any old buffered PodStatus
(consolidated during any disruptions or failures) from overriding the newer entry in the cache.
Instead of getting the Sandbox
and Container
statuses independently and using the timestamp calculated from the kubelet process, Generic PLEG
can fetch the PodStatus
directly from the CRI Runtime using the modified PodSandboxStatus rpc of the RuntimeService.
The modified PodSandboxStatusRequest
will have a field includeContainer
to indicate if PodSandboxStatusResponse
should have ContainerStatuses
and the corresponding timestamp.
message PodSandboxStatusRequest {
// ID of the PodSandbox for which to retrieve status.
string pod_sandbox_id = 1;
// Verbose indicates whether to return extra information about the pod sandbox.
bool verbose = 2;
// IncludeContainers indicates whether to include ContainerStatuses and timestamp in the PodSandboxStatusResponse
bool includeContainers = 3;
}
message PodSandboxStatusResponse {
// Status of the PodSandbox.
PodSandboxStatus status = 1;
// Info is extra information of the PodSandbox. The key could be arbitrary string, and
// value should be in json format. The information could include anything useful for
// debug, e.g. network namespace for linux container based container runtime.
// It should only be returned non-empty when Verbose is true.
map<string, string> info = 2;
// ContainerStatus needs to be included if includeContainers is set true PodSandboxStatusRequest
repeated ContainerStatus containerStatues = 3;
// Timestamp needs to be included if includeContainers is set true in PodSandboxStatusRequest
int64 timestamp = 4;
}
Another RPC will be introduced in the CRI Runtime Service,
// GetContainerEvents gets container events from the CRI runtime
rpc GetContainerEvents(GetEventsRequest) returns (stream ContainerEventResponse) {}
message ContainerEventResponse {
// ID of the container
string container_id = 1;
// Type of the container event
ContainerEventType container_event_type = 2;
// Creation timestamp of this event
int64 created_at = 3;
// Metadata of the pod sandbox
PodSandboxMetadata pod_sandbox_metadata = 4;
// Sandbox status of the pod
PodSandboxStatus pod_sandbox_status = 5;
// Container statuses of the pod
repeated ContainerStatus containers_statuses = 6;
}
Creation timestamp of the event will be used when saving the PodStatus
in the kubelet cache.
enum ContainerEventType {
// Container created
CONTAINER_CREATED_EVENT = 0;
// Container started
CONTAINER_STARTED_EVENT = 1;
// Container stopped
CONTAINER_STOPPED_EVENT = 2;
// Container deleted
CONTAINER_DELETED_EVENT = 3;
}
While using Evented PLEG
, the existing Generic PLEG
is set to relist with the increased period. But in case Evented PLEG
faces temporary disruptions in the grpc connection with the runtime, there is a chance that when the normalcy is restored the incoming buffered events (which are outdated now) might end up overwriting the latest pod status in the cache updated by the Generic PLEG
. Having a cache setter that only updates if the pod status in the cache is older than the current pod status helps in mitigating this issue.
At present kubelet updates the cache using the Set function.
Pod status should be updated in the cache only if the new status update has timestamp newer than the timestamp of the already present in the cache.
func (c *cache) Set(id types.UID, status *PodStatus, err error, timestamp time.Time) (updated bool) {
c.lock.Lock()
defer c.lock.Unlock()
// Set the value in the cache only if it's not present already
// or the timestamp in the cache is older than the current update timestamp
if val, ok := c.pods[id]; !ok || val.modified.Before(timestamp) {
c.pods[id] = &data{status: status, err: err, modified: timestamp}
c.notify(id, timestamp)
return true
}
return false
}
This has no impact on the existing Generic PLEG
when used without Evented PLEG
because its the only entity that sets the cache and it does so every second (if needed) for a given pod.
For this feature to work Kubelet needs to be used with a compatible CRI Runtime that is capable of generating CRI Events. During the Kubelet start up if it detects that CRI Runtime doesn't support generating and streaming CRI Events, it should automatically fall back to using Generic PLEG
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
kubernetes/kubernetes/tree/master/pkg/kubelet
:15-Jun-2022
-64.5
- Ensure the
PodLifecycleEvent
is generated by the kubelet when the CRI events are received. - Verify the Pod status is updated correctly when the CRI events are received.
- Existing Pod Lifecycle tests must pass fine even after increasing the relisting frequency.
- E2E Node Conformance non-blocking presubmit job
- E2E Node Conformance non-blocking periodic job
- Feature implemented behind a feature flag
- Existing
node e2e
tests around pod lifecycle must pass
- Add E2E Node Conformance presubmit job in CI
- Add E2E Node Conformance periodic job in CI
To test the performance and scalability of Evented PLEG, it is necessary to generate a large number of CRI Events by creating and deleting a significant number of containers within a short period of time. The following steps outline the stress test:
Since this is a disruptive stress test, it should be part of a node e2e Serial
job. CRI Events are generated per container, and therefore, the test should create a substantial number of containers within a single pod. After creation, these containers should run to completion and then be removed by the kubelet. This process will ensure the generation of CONTAINER_CREATED_EVENT, CONTAINER_STARTED_EVENT, CONTAINER_STOPPED_EVENT, and CONTAINER_DELETED_EVENT.
The test should continue to create these containers until the histogram metric evented_pleg_connection_latency_seconds
begins to show distinct latency values in its 1-second bucket. This indicates that it is taking 1 second or longer for an event to be observed by the kubelet after getting generated by the runtime. Typical values for this latency are around 0.001 seconds, so it is safe to assume 1 second as a measure indicates that the system is under stress.
Once the evented_pleg_connection_latency_seconds
is observed to be greater than 1 second, new container creation is halted, and the rest of the already created containers are run to completion. At this point, kubelet_evented_pleg_connection_latency_seconds_count
can be used to determine the total number of CRI Events generated during this test.
To test the ability of the Kubelet to recover the latest state of a container after a restart, a disruption test should be included in the node e2e Serial job. The test should involve creating a container with a sufficient time to completion (e.g. sleep 20), and then immediately stopping the Kubelet once the container enters the Running
state. The CRI runtime should emit CRI events indicating the change in container state, but the Kubelet will miss the CONTAINER_STOPPED_EVENT
for that container.
To validate the Kubelet's ability to recover the latest state of the container, the test should query the CRI endpoint to confirm that the container has ran to completion successfully. Once the Kubelet is started again, it should be able to query the CRI runtime and update its cache with the latest state of the container. If the Kubelet accurately reports the state of the container as Completed
, the test will be considered passed.
Currently, the Kubelet attempts to reconnect five times before falling back on Generic PLEG in the event of errors encountered during the streaming connection with CRI Runtime. However, in situations where the CRI Runtime is taken down for maintenance purposes, the Kubelet may exhaust all of its reconnection attempts and never try again, resulting in the usage of Generic PLEG
despite the CRI Runtime's compatibility with Evented PLEG
. To address this issue, a backoff logic with exponentially increasing sequence and an upper limit should be implemented to retry re-establishing the connection. Once the upper limit is reached, it should periodically try with that value. By doing so, the Kubelet will be able to reconnect to the CRI Runtime even after multiple attempts have failed, and it will be able to utilize Evented PLEG
when possible. e.g.
Retry immediately
Retry after 1 second
Retry after 2 seconds
Retry after 4 seconds
Retry after 8 seconds
Retry after 16 seconds
Retry after 32 seconds
Retry after 64 seconds
Retry after every 60 seconds indefinitely
Make sure existing jobs in following test grid tabs that use Generic PLEG
continue to use it by making sure that Evented PLEG
is disabled for them.
https://testgrid.k8s.io/sig-node-release-blocking https://testgrid.k8s.io/sig-node-kubelet https://testgrid.k8s.io/sig-node-containerd https://testgrid.k8s.io/sig-node-cri-o https://testgrid.k8s.io/sig-node-presubmits
N/A
N/A.
Since this feature alters only the way kubelet determines the container statuses, this section is irrelevant to this feature.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: EventedPLEG
- Components depending on the feature gate: kubelet
- CRI runtime must enable/disable this feature as well for it to work properly.
This feature does not introduce any user facing changes. Although users should notice increased performance of the kubelet which should result in reduced overhead of kubelet and the CRI runtime after enabling this feature.
Yes, kubelet needs to be restarted to disable this feature.
If reenabled, kubelet will again start updating container statuses using CRI events instead of relisting. Everytime this feature is enabled or disabled, the kubelet will need to be restarted. Hence, the kubelet will start from a clean state.
These unit test performs a health check on Evented PLEG.
This feature relies on the CRI runtime events to determine the container statuses. If the CRI runtime is not upgraded to the version which emits those CRI events before enabling this feature, the kubelet will not be able to determine the container statuses immediately. However, we aren't getting rid of the exiting relisting altogether. So the kubelet should eventually reconcile the container statuses using relisting abeit rather more infrequently due to increased relisting period that comes with this feature.
If users observe incosistancy in the container statuses reported by the kubelet and the CRI runtime (e.g. using a tool like crictl
) after enabling this feature, they should consider rolling back the feature.
Apart from that cluster admins can monitor the state of evented PLEG's connection with the CRI runtime using following metrics,
evented_pleg_connection_error_count
- The count of errors encountered during the establishment of streaming connection with the CRI runtime.evented_pleg_connection_success_count
- The count of successful streaming connections with the CRI runtime.evented_pleg_connection_latency_seconds
- The latency of streaming connection with the CRI runtime, measured in seconds.evented_pleg_notifications_received
- The number of notifications received through streaming connection with the CRI runtime.
Following scenarios were tested in manual tests,
Scenario 1: Kubelet Upgrade without Corresponding CRI Runtime Upgrade
Step 1: Kubelet is upgraded but CRI runtime remains unchanged. Kubelet falls back to using the Generic PLEG as the CRI runtime does not emit any CRI events. Step 2: Kubelet is downgraded, but the CRI runtime version remains the same. Kubelet continues to work with the existing Generic PLEG. Step 3: If the Kubelet is upgraded again, it behaves similarly to step 1.
Scenario 2: Kubelet and CRI Runtime Upgrade Together
Step 1: Both the Kubelet and CRI runtime are upgraded. Since the CRI runtime emits CRI events, Kubelet uses the Evented PLEG with an increased relisting period for the Generic PLEG. Step 2: Kubelet and CRI runtime are downgraded. Kubelet defaults to using the Generic PLEG. Step 3: If the Kubelet is upgraded again, it behaves similarly to Scenario 1, Step 1.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
- Add a metric
kube_pod_missed_events
that describes when a pod changed state between relisting periods without a corresponding event.- This is to catch situations where a CRI implementation is buggy and is not properly emitting events.
This feature is not directly going to be used by the workloads. This is an optimization for the kubelet to determine the container statuses.
However, users can use existing pod lifecycle related pod metrics such as, kube_pod_start_time
or kube_pod_completion_time
and compare the timestamps reported in the CRI runtime (e.g. CRI-O
or containerd
) logs. The time difference must always be lesser than the relisting frequency.
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details: In the kubelet logs look for
PodLifecycleEvent
getting generated from the received CRI runtime event. This is a good indicator that the feature is working.
- Details: In the kubelet logs look for
- The time between pod status change and Kubelet reporting the pod status change must decrease on average from the current polling interval of 1 second.
- The number listed in the
kube_pod_missed_events
metric should remain low (ideally zero or at least near-zero).
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
kube_pod_start_time
- Aggregation method: Compare against the start time reported in the CRI runtime logs.
- Components exposing the metric: Kubelet
- Metric name:
- Metric name:
kube_pod_completion_time
- Aggregation method: Compare against the container exit time reported in the CRI runtime logs.
- Components exposing the metric: Kubelet
- Other (treat as last resort)
- Details: Admins can also look for the
PodLifecycleEvent
getting generated from the received CRI runtime event in the kubelet logs. This is a good indicator that the feature is working.
- Details: Admins can also look for the
Are there any missing metrics that would be useful to have to improve observability of this feature?
Kubelet already has the metrics for the pod status update times (e.g kube_pod_start_time
and kube_pod_completion_time
). But there is no standard metric emitted by the various CRI runtime implementations for the pod statuses update times. It would be ideal if we had a standard metrics for the container statuses emitted by all the CRI implementations.
- CRI Runtime
- CRI runtimes that are capable of emitting CRI events must be installed and running.
- Impact of its outage on the feature: Kubelet will detect the outage and fall back on the
Generic PLEG
with the default relisting period to make sure the pod statuses are updated correctly. - Impact of its degraded performance or high-error rates on the feature:
- Any instability with the CRI runtime events stream that results in an error can be detected by the kubelet. Such an error will result in the kubelet falling back to the
Generic PLEG
with default relisting period to make sure the pod statuses are updated in time. - If the instability is only of the form degraded performance but does not result in an error then the kubelet will not be able to fall back to the
Generic PLEG
with default relisting period and will continue to use the CRI runtime events stream. With the changes proposed in the section Pod Status update in the Cache should help in handling this scenario.
- Any instability with the CRI runtime events stream that results in an error can be detected by the kubelet. Such an error will result in the kubelet falling back to the
- Kubelet should emit a metric
kube_pod_missed_events
when it detects pods changing state between relist periods not caught by an event.
- Impact of its outage on the feature: Kubelet will detect the outage and fall back on the
- CRI runtimes that are capable of emitting CRI events must be installed and running.
No.
No.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Since it's a kubelet specific feature, it has no effect of unavailibility of either API server and/or etcd.
- Incorrect container statuses
- Detection: If the user notices that the container statuses reported by the kubelet are not consistent with the container statuses reported by the CRI runtime (i.e. using say,
crictl
) then we are running into the failure of this feature. - Mitigations: They will have to disable this feature and open an issue for further investigation.
- Diagnostics: CRI Runtime logs (such as,
cri-o
orcontainerd
) may not be consistent with the kubelet logs on container statuses.
- Detection: If the user notices that the container statuses reported by the kubelet are not consistent with the container statuses reported by the CRI runtime (i.e. using say,
- Missed events
- Detection: If there's a bug in the CRI implementation, it may miss events or not send them correctly. Kubelet will see this when the statuses are listed. It should emit a metric
kube_pod_missed_events
to quantify. - Mitigations: The feature could be disabled or relist frequency could be increased until CRI fixes.
- Diagnostics: Increasing value of
kube_pod_missed_events
metric coming from Kubelet.
- Detection: If there's a bug in the CRI implementation, it may miss events or not send them correctly. Kubelet will see this when the statuses are listed. It should emit a metric
Disabling this feature in the kubelet will revert to the existing relisting PLEG.
- Alpha(1.25)
- Beta(default false, 1.27)
- kubernetes/kubernetes#115967
- PR for presubmit Node e2e job - kubernetes/test-infra#28366
- PR for periodic Node e2e job - kubernetes/test-infra#28592
- v1.29 bugfix: kubernetes/kubernetes#120942
- Revert to Alpha(1.30): backported to v1.27.9, v1.28.6, v1.29.1, as there is a known issue kubernetes/kubernetes#121349 and kubernetes/kubernetes#121003 that will make static pod failed to start.
- revert PR kubernetes/kubernetes#122697
- v1.30 bugfix: kubernetes/kubernetes#122475
This KEP introduces changes to the kubelet PLEG, which is very core to the kubelet operation.
The Kubelet PLEG can be made to utilize the events from cadvisor as well. But we are trying to reduce the kubelet's dependency on cadvisor so that option is not viable. This is also discussed in the older enhancement in detail.