-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Communication between Kubernetes components #6363
Comments
+1 to anything that will make the reasoning simple. Current state when multiple component can write exactly the same thing make reasoning about system semantics very hard (to put it lightly). |
cc @liggitt
|
And types of communication: To node:
From node:
|
Thanks for filing the issue. Give me two seconds before you start discussing it, so that I can actually write it up. :) |
cc @zmerlynn, and I suspect @roberthbailey for SSLness |
|
|
|
[Note: I am likely to update this text as I think of more things, so please read this issue through the web, not email.] I think we need to talk about how we structure communication between components in Kubernetes. I'm not necessarily proposing a change pre-1.0; this is more of a longer-term idea. Currently, all communication between components is through objects that are (1) persistent, and (2) considered to be part of the external API, which is stable/slow-evolving, versioned with long deprecation periods, and subject to multiple levels of review, debate, and approval for changes. I believe this one-size-fits-all model is bad for performance/scalability and bad for the project's velocity. I'm proposing an additional, lighter-weight model inter-component communication mechanism, which has neither property (1) nor (2), to be used in situations where the existing model is deemed inappropriate. (Exactly what those circumstances are is TBD, but it would only be used for some subset of the internal communication between Kubernetes components.) This model is a very small tweak to the existing model. The only difference is that the objects it uses would not be persisted. We would exploit this fact to allow APIs that use these objects to evolve more quickly. But everything would still go through PUT/POST/GET to the API server. Let me give you a concrete example of how it would work. Let's say we move NodeStatus to this model. Let's also assume the API server is sharded, to alleviate concerns that this approach doesn't work with sharding. A node would POST/PUT a NodeStatus object to its API server shard. The API server (or NodeController) would do whatever business logic processing it wants, and if there is some state that clients need to know, and/or all of the API server (or NodeController) replicas need to know, and/or that is needed for crash recovery, it would write a new object with just that information to etcd. For example, we might have a UserVisibleNodeStatus object that is the one external clients read; it contains a processed subset of the information from NodeStatus (and perhaps from other objects as well) and can be updated less frequently. (Please don't dwell too much on my use of NodeStatus as the example. I think this approach applies to inter-component communication in Kubernetes more generally.) Here are the advantages I see
The only gotcha I can see is that we'd need to implement some kind of replacement for etcd watch, to allow components to watch these ephemeral objects. I think we wouldn't need something like ResourceVersion if we assume each of these ephemeral objects only have one writer. |
I don't see an issue with evolving internal APIs at a different pace and with different guarantees. Virtual resources makes a lot of sense when we want to specialize the use case.
|
I am not sure I see the point if that is the only difference. All the heavyweight process is also for maintaining API compatibility, which you still need-- clusters have to upgrade. |
When someone says "heavyweight process" w.r.t. API design, I mentally substitute "discipline" :)
|
It might be replicated, but that's not sharding. I would propose that the storage layer (etcd today) is the place where we do sharding, should that be necessary. Otherwise every client needs to be aware of the sharding details, or we have to put in another layer. |
This is an argument for changing the characteristics of our storage backend, IMO.
Not true. See our binding object, which is not persisted. (directly)
I guess I just don't see how that would actually cash out in a simplified compatibility matrix. |
[Note: copied from internal email with some modification] @davidopp I agreed with you on requiring every message to be persisted is an issue for our performance and scalability (#5953). On kubelet side, to cope with the rule and avoid too much writes, kubelet only post PodStatus on change (which is good anyway). But we use NodeStatus as ping message today to indicate if a node alive, which is quite expensive since it requires posting in a high Related to today's issue caused by NodeStatus and NodeController, there are several ways potentially to solve today's issue if we could loose the requirement (1): all communication between components is through objects should be persistent:
|
If we don't want the master to initiate contact with the Kubelet and we move to this model, I'd suggest adding a control channel where the master can respond to a ping message with a request for the full NodeStatus. This will allow the master to have the ability to more proactively get current state without needing to change anything in the Kubelet. |
I'd prefer that we discuss details of our current NodeStatus problems elsewhere. This proposal is intended to be generic. I don't think "changing the characteristics of our storage backend" is going to solve the problem that we force every message between Kubernetes components to hit disk. Sure, we can run our key-value store on a huge distributed SSD-based cluster to improve performance but why? Some data really is ephemeral and is just a message between two components, not meant for public scrutiny or long-term storage. Why build a crazy-complicated storage system to store things that have no reason to be stored in the first place, instead of just identifying the things that don't need to be stored, and not storing them? And why require a full API sign-off for changing the way two components communicate with each other, if it's not meant for public consumption? I do think API evolution is related. If you have something that is just an ephemeral message between two components and you can guarantee that the components will be updated atomically, then you don't have to worry about API evolution because you can just restart them together. Even if you can't guarantee atomic update, the fact that they don't use persistent objects for communication means you at least don't have to worry about rewriting on-disk data formats for the objects they use for communicating. And because they're internal components, even in the non-atomic case you only have to support one forward/backward version compatibility--that is, knowing that two components will be upgraded "approximately together" makes things simpler than having to support multiple versions on each end. I think this proposal kills two birds with one stone: it addresses performance problems we know we're going to have, and by distinguishing internal APIs from external APIs it gives us more flexibility in evolving the former (I admit that the benefit is perhaps even more psychological than it is technical). |
BTW see #3247 for an example of where someone was going to use events as an inter-component communication mechanism. This issue is going to come up many times and I think we should try to support it in a reasonable way rather than forcing everything to be a full persisted user-facing API object. |
For example, you could say that version N of component A will only talk to version N of component B (i.e. no inter-version compatibility). Even if they don't run on the same machine, as long as you upgrade them approximately together, you can have whichever one comes up first just block until it's talking to a compatible peer. So the total downtime is limited to the time skew between upgrading the components on the two machines. This doesn't work when you have to do rolling upgrade (e.g. you can't have NodeController just block until all Kubelets have been upgraded to a new version) but it does work in the case of master components where there is only a single logical instance of each component. |
Replying piecemeal: I agree we will want some API resources that are stored in separate etcd instances, or aren't necessarily stored in etcd, or potentially not persisted to disk at all. I've insisted that resource usage collected from nodes be kept out of core API objects for that exact reason: |
I am one of the people who thinks we shouldn't hide "internal" APIs: But I do think we should group APIs into multiple API prefixes that can be independently versioned: #3806, #635 |
In Borg, we started with a lot of state not stored persistently and then persisted more and more of it over time, as an inter-component communication medium, as an archival medium, and for stability across master restarts and elections. In Omega, I enforced a single-writer rule for each object. This meant no true write conflicts, but had other consequences, such as more objects. Joins were never fully realized, which is one reason why Kubernetes started with unified objects containing both spec and status. |
If we had API plugins #991, we could potentially create API endpoints that more directly dispatched to controllers. |
Nowhere. The principle is sound though. We can also expose bulk endpoints if need be when the time comes.
|
I don't object to bulk endpoints, but what efficiency do we hope to gain by it? We definitely need to profile/trace before we do that. I don't believe that it will help, and it could even hurt by serializing operations that could otherwise be performed in parallel. In order to be of benefit, batching would need to reduce the amount of data sent over the wire and/or reduce decoding and other work in the apiserver. If etcd supported batching, that could help amortize transaction overheads, but we could send mutations in batches without changing our API. |
One benefit would be using speedy on larger chunks - probably see much higher compression (keys and common values) for bulk updates than singletons.
|
What do we want to do with this? It seems that some of things brought up are already implemented, but not all issues mentioned in the first entry are solved. |
Let's keep this issue open for future discussion. |
+1 |
Update:
Unresolved issues remain (in no particular order):
|
It's also the case that unstable, non-backward-compatible APIs are hostile to extensibility. That's tantamount to declaring that nobody else in the entire Kubernetes community should be able to extend or replace the client or server without replacing/forking both. That's contradictory to our goals and design principles. To me, that suggests the communication point and/or API abstraction were not chosen correctly. |
To be clear, what I'm proposing is that we be able to create components that offer only an "internal-only" API. A component that exports an "internal-only" API would be part of the system that people aren't expected to extend or replace independently. (Of course, they could submit PRs that we would upstream into these components.) We can put restrictions on these components like: these API objects are never persisted, and the component must only communicate with a client if the client is of an expected version. I agree that you absolutely would have to upgrade the component and its client component(s) together; this is what relieves you from having to worry about forward/backward compatibility. The upgrade doesn't have to be atomic, but the component would refuse to talk to a client until the client is upgraded. You could argue that wanting to be able to quickly evolve an API in incompatible ways is a sign that "the communication point and/or API abstraction were not chosen correctly." But it's often hard to choose these kinds of things correctly up-front; it's useful to have a balance between "prototype and iterate" and making sure you get it right the first time. I think this is basically a philosophical argument about "in an open-source, extensible toolkit, the notion of "private" APIs is questionable." Clearly there are private interfaces within components, for example the interface between controller manager and controllers, or the way you write a plugin (for our various types of plugins). I'm just suggesting that we also be able to have private interfaces between components, in some limited situations. |
@bgrant0607 and I are going to discuss this further offline. |
But to clarify one thing I said earlier -- the mechanism I'm describing isn't really for "prototype and iterate." Having alpha API versions and the experimental API prefix already give you that capability. It's about components that you want to have an private/internal API "forever." |
Every time we create such interdependencies, they also make Kubernetes harder to deploy and upgrade. Many people don't use our /cluster implementation. We also discussed offline that we could do what we've suggested with the scheduler and its proposed extension API: That when we broke compatibility, WE would fork both sides of the API, so that we wouldn't break users that were dependent on one side or the other. |
I'll also point out that while our external dependencies (Heapster, InfluxDB, etcd, Docker, cAdvisor) have different API conventions, they maintain backward compatibility. |
Closing due to age. |
Conversation started by @davidopp due to complexities arising from node-to-master communication (e.g., #6193, #6285, #6077, #6063, #6052, #5953).
Overall communication issues:
I'll wait for @davidopp to fill in his proposal.
cc @erictune @lavalamp @thockin @dchen1107 @smarterclayton @gmarek
The text was updated successfully, but these errors were encountered: