Skip to content

Latest commit

 

History

History

589-efficient-node-heartbeats

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Efficient Node Heartbeats

Table of Contents

Summary

Node heartbeats are necessary for correct functioning of Kubernetes cluster. This proposal makes them significantly cheaper from both scalability and performance perspective.

Motivation

While running different scalability tests we observed that in big enough clusters (more than 2000 nodes) with non-trivial number of images used by pods on all nodes (10-15), we were hitting etcd limits for its database size. That effectively means that etcd enters "alert mode" and stops accepting all write requests.

The underlying root cause is combination of:

  • etcd keeping both current state and transaction log with copy-on-write
  • node heartbeats being pontetially very large objects (note that images are only one potential problem, the second are volumes and customers want to mount 100+ volumes to a single node) - they may easily exceed 15kB; even though the patch send over network is small, in etcd we store the whole Node object
  • Kubelet sending heartbeats every 10s

This proposal presents a proper solution for that problem.

Note that currently (by default):

  • Lack of NodeStatus update for <node-monitor-grace-period> (default: 40s) results in NodeController marking node as NotReady (pods are no longer scheduled on that node)
  • Lack of NodeStatus updates for <pod-eviction-timeout> (default: 5m) results in NodeController starting pod evictions from that node

We would like to preserve that behavior.

Goals

  • Reduce size of etcd by making node heartbeats cheaper

Non-Goals

The following are nice-to-haves, but not primary goals:

  • Reduce resource usage (cpu/memory) of control plane (e.g. due to processing less and/or smaller objects)
  • Reduce watch-related load on Node objects

Proposal

We propose introducing a new Lease built-in API in the newly create API group coordination.k8s.io. To make it easily reusable for other purposes it will be namespaced. Its schema will be as following:

type Lease struct {
  metav1.TypeMeta `json:",inline"`
  // Standard object's metadata.
  // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
  // +optional
  ObjectMeta metav1.ObjectMeta `json:"metadata,omitempty"`

  // Specification of the Lease.
  // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
  // +optional
  Spec LeaseSpec `json:"spec,omitempty"`
}

type LeaseSpec struct {
  HolderIdentity       string           `json:"holderIdentity"`
  LeaseDurationSeconds int32            `json:"leaseDurationSeconds"`
  AcquireTime          metav1.MicroTime `json:"acquireTime"`
  RenewTime            metav1.MicroTime `json:"renewTime"`
  LeaseTransitions     int32            `json:"leaseTransitions"`
}

The Spec is effectively of already existing (and thus proved) LeaderElectionRecord. The only difference is using MicroTime instead of Time for better precision. That would hopefully allow us go get directly to Beta.

We will use that object to represent node heartbeat - for each Node there will be a corresponding Lease object with Name equal to Node name in a newly created dedicated namespace (we considered using kube-system namespace but decided that it's already too overloaded). That namespace should be created automatically (similarly to "default" and "kube-system", probably by NodeController) and never be deleted (so that nodes don't require permission for it).

We considered using CRD instead of built-in API. However, even though CRDs are the new way for creating new APIs, they don't yet have versioning support and are significantly less performant (due to lack of protobuf support yet). We also don't know whether we could seamlessly transition storage from a CRD to a built-in API if we ran into a performance or any other problems. As a result, we decided to proceed with built-in API.

With this new API in place, we will change Kubelet so that:

  1. Kubelet is periodically computing NodeStatus every 10s (at it is now), but that will be independent from reporting status
  2. Kubelet is reporting NodeStatus if:
    • there was a meaningful change in it (initially we can probably assume that every change is meaningful, including e.g. images on the node)
    • or it didn’t report it over last node-status-update-period seconds
  3. Kubelet creates and periodically updates its own Lease object and frequency of those updates is independent from NodeStatus update frequency.

In the meantime, we will change NodeController to treat both updates of NodeStatus object as well as updates of the new Lease object corresponding to a given node as healthiness signal from a given Kubelet. This will make it work for both old and new Kubelets.

We should also:

  1. audit all other existing core controllers to verify if they also don’t require similar changes in their logic (ttl controller being one of the examples)
  2. change controller manager to auto-register that Lease CRD
  3. ensure that Lease resource is deleted when corresponding node is deleted (probably via owner references)
  4. [out-of-scope] migrate all LeaderElection code to use that CRD

Once all the code changes are done, we will:

  1. start updating Lease object every 10s by default, at the same time reducing frequency of NodeStatus updates initially to 40s by default. We will reduce it further later. Note that it doesn't reduce frequency by which Kubelet sends "meaningful" changes - it only impacts the frequency of "lastHeartbeatTime" changes.
  2. announce that we are going to reduce frequency of NodeStatus updates further and give people 1-2 releases to switch their code to use Lease object (if they relied on frequent NodeStatus changes)
  3. further reduce NodeStatus updates frequency to not less often than once per 1 minute. We can’t stop periodically updating NodeStatus as it would be API breaking change, but it’s fine to reduce its frequency (though we should continue writing it at least once per eviction period).

Based on experiments, unsuccessful attempt to enable Server Side Apply due to performance issues it was causing, in v1.17 release, we will:

  • reduce frequency of NodeStatus update to once every 5 minutes instead of 1 minute (note that if any condition changes, the update will happen within next 10 seconds anyway, so we are only talking about periodic heartbeats)
  • reduce frequency of updating NodeStatuses in NodeProblemDetector also to 5 minutes (to keep those two frequencies in sync) (similarly as Kubelet, if any condition changes, update will happen within next 30s as it is now)

Obviously, in both cases, we are only talking about default values, which may be changed by cluster operators if needed. We don't want to increase it further, as we would like to keep at least one NodeStatus update at least every eviction period (which is defaulted to 5 minutes).

This change will help with enabling server-side-apply for all operation and will give more performance slack in largest supported clusters. As an example, assuming average node object size 20KB (which happens in production clusters), that will jump to ~30KB with server-side-apply feature, in 5k-node clusters, we are reducing etcd write throughput from 150MB/min to 30MB/min. Note that this isn't the only gain, because we also save on sending watch events, in kube-apiserver, and so on.

Additionally, we will modify kubectl describe to show the information from the corresponding Lease object.

With those two changes and positive user-feedback we already got, we will also graduate the feature to GA in 1.17 too.

Other notes:

  1. Additional advantage of using Lease for that purpose would be the ability to exclude it from audit profile and thus reduce the audit logs footprint.

Risks and Mitigations

Increasing default frequency of NodeStatus updates may potentially break clients relying on frequent Node object updates. However, in non-managed solutions, customers will still be able to restore previous behavior by setting appropriate flag values. Thus, changing defaults to what we recommend is the path to go with.

To evaluate the risk of further reduction of NodeStatus updates, I went through couple thousands of files mentioning LastHeartbeatTime and k8s on github to understand the potential of breaking some system. There are couple places that are incompatible with node leases, but they would already be broken by current 1m frequency, e.g.:

That said, it sounds better to update the frequency before going to GA.

Testing Plan

There is a set of dedicated end-to-end tests added for that feature excercising:

  • whether Lease object is being created and update by Kubelet (gce-cos-master-default)
  • whether Kubelet is reducing frequency of node status updates appropriately (gce-cos-master-default)
  • whether Lease object is deleted on node deletion (gce-cos-master-serial)

Additionally, if the feature gate is switched on, all existing test suites are implicitly testing behavior of this feature, as this is then the signal for healthiness of nodes.

Additionally, the main benefit from this feature is obviously performance and scalability. For this purpose, as part of all scalability tests, we are additionally measuring maximum etcd database size.

Graduation Criteria

The API can be immediately promoted to Beta, as the API is effectively a copy of already existing LeaderElectionRecord. It will be promoted to GA once it's gone a sufficient amount of time as Beta with no changes.

The changes in components logic (Kubelet, NodeController) should be done behind a feature gate. We suggest making that enabled by default once the feature is implemented.

Beta:

  • Confirmed scalability/performance gain: decreased of total etcd size by 2x+ on 5k-node clusters and no drop in any other scalability SLIs (in fact we observed decrease in API call latencies by up to 20-30% for some resources). Verified on both real clusters and Kubemark.

GA:

  • Enabled by default for a release with no complaints.
  • Frequency of NodeStatus defaulted to 5m
  • kubectl describe node showing information from corresponding Lease object

Implementation History

  • v1.11: KEP Summary, Motivation and Proposal merged
  • v1.13: Feature launched to Alpha (default: off)
  • v1.14: Feature launched to Beta (default: on)
  • v1.16: Minor improvements based on user feedback (e.g. 80429, 81174)

Alternatives

We considered a number of alternatives, most important mentioned below.

Dedicated “heartbeat” object instead of “leader election” one

Instead of introducing and using “lease” object, we considered introducing a dedicated “heartbeat” object for that purpose. Apart from that, all the details about the solution remain pretty much the same.

Pros:

  • Conceptually easier to understand what the object is for

Cons:

  • Introduces a new, narrow-purpose API. Lease is already used by other components, implemented using annotations on Endpoints and ConfigMaps.

Events instead of dedicated heartbeat object

Instead of introducing a dedicated object, we considered using “Event” object for that purpose. At the high-level the solution looks very similar. The differences from the initial proposal are:

  • we use existing “Event” api instead of introducing a new API
  • we create a dedicated namespace; events that should be treated as healthiness signal by NodeController will be written by Kubelets (unconditionally) to that namespace
  • NodeController will be watching only Events from that namespace to avoid processing all events in the system (the volume of all events will be huge)
  • dedicated namespace also helps with security - we can give access to write to that namespace only to Kubelets

Pros:

  • No need to introduce new API
    • We can use that approach much earlier due to that.
  • We already need to optimize event throughput - separate etcd instance we have for them may help with tuning
  • Low-risk roll-forward/roll-back: no new objects is involved (node controller starts watching events, kubelet just reduces the frequency of heartbeats)

Cons:

  • Events are conceptually “best-effort” in the system:
    • they may be silently dropped in case of problems in the system (the event recorder library doesn’t retry on errors, e.g. to not make things worse when control-plane is starved)
    • currently, components reporting events don’t even know if it succeeded or not (the library is built in a way that you throw the event into it and are not notified if that was successfully submitted or not). Kubelet sending any other update has full control on how/if retry errors.
    • lack of fairness mechanisms means that even when some events are being successfully send, there is no guarantee that any event from a given Kubelet will be submitted over a given time period So this would require a different mechanism of reporting those “heartbeat” events.
  • Once we have “request priority” concept, I think events should have the lowest one. Even though no particular heartbeat is important, guarantee that some heartbeats will be successfully send it crucial (not delivering any of them will result in unnecessary evictions or not-scheduling to a given node). So heartbeats should be of the highest priority. OTOH, node heartbeats are one of the most important things in the system (not delivering them may result in unnecessary evictions), so they should have the highest priority.
  • No core component in the system is currently watching events
    • it would make system’s operation harder to explain
  • Users watch Node objects for heartbeats (even though we didn’t recommend it). Introducing a new object for the purpose of heartbeat will allow those users to migrate, while using events for that purpose breaks that ability. (Watching events may put us in tough situation also from performance reasons.)
  • Deleting all events (e.g. event etcd failure + playbook response) should continue to not cause a catastrophic failure and the design will need to account for this.

Reuse the Component Registration mechanisms

Kubelet is one of control-place components (shared controller). Some time ago, Component Registration proposal converged into three parts:

  • Introducing an API for registering non-pod endpoints, including readiness information: #18610
  • Changing endpoints controller to also watch those endpoints
  • Identifying some of those endpoints as “components”

We could reuse that mechanism to represent Kubelets as non-pod endpoint API.

Pros:

  • Utilizes desired API

Cons:

  • Requires introducing that new API
  • Stabilizing the API would take some time
  • Implementing that API requires multiple changes in different components

Split Node object into two parts at etcd level

We may stick to existing Node API and solve the problem at storage layer. At the high level, this means splitting the Node object into two parts in etcd (frequently modified one and the rest).

Pros:

  • No need to introduce new API
  • No need to change any components other than kube-apiserver

Cons:

  • Very complicated to support watch
  • Not very generic (e.g. splitting Spec and Status doesn’t help, it needs to be just heartbeat part)
  • [minor] Doesn’t reduce amount of data that should be processed in the system (writes, reads, watches, …)

Delta compression in etcd

An alternative for the above can be solving this completely at the etcd layer. To achieve that, instead of storing full updates in etcd transaction log, we will just store “deltas” and snapshot the whole object only every X seconds/minutes.

Pros:

  • Doesn’t require any changes to any Kubernetes components

Cons:

  • Computing delta is tricky (etcd doesn’t understand Kubernetes data model, and delta between two protobuf-encoded objects is not necessary small)
  • May require a major rewrite of etcd code and not even be accepted by its maintainers
  • More expensive computationally to get an object in a given resource version (which is what e.g. watch is doing)

Replace etcd with other database

Instead of using etcd, we may also consider using some other open-source solution.

Pros:

  • Doesn’t require new API

Cons:

  • We don’t even know if there exists solution that solves our problems and can be used.
  • Migration will take us years.