-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuring scheduler via json configuration file #4674
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA) at https://cla.developers.google.com/. If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check the information on your CLA or see this help article on setting the email on your git commits. Once you've done that, please reply here to let us know. If you signed the CLA as a corporation, please let us know the company's name. |
@bgrant0607 @davidopp @mikedanese Please review |
Assigning to @bgrant0607, feel free to re-assign. |
@smarterclayton PTAL - this incorporates the feedback that you provided. |
@abhgupta can you rebase? It looks like this doesn't merge cleanly anymore. |
21f4c71
to
2ddb069
Compare
I have some configuration examples written up that I'll submit once this PR goes through some feedback/review. Sample JSON configuration:
|
@@ -37,9 +37,9 @@ func affinityPredicates() util.StringSet { | |||
"PodFitsResources", | |||
"NoDiskConflict", | |||
// Ensures that all pods within the same service are hosted on minions within the same region as defined by the "region" label | |||
factory.RegisterFitPredicate("ServiceAffinity", algorithm.NewServiceAffinityPredicate(factory.PodLister, factory.ServiceLister, factory.MinionLister, []string{"region"})), | |||
factory.RegisterFitPredicate("RegionAffinity", algorithm.NewServiceAffinityPredicate(factory.PodLister, factory.ServiceLister, factory.MinionLister, []string{"region"})), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do "Region" and "Zone" mean in this context? We do not advocate running a single Kubernetes across multiple availability zones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current solution is flexible and uses labels to define topological levels. The levels do not themselves have any meaning within the system, and the user is free to ascribe any meaning to them. The labels could be regions/zones/racks or building/server-room/rack.
Given that cross-region deployments is something that is not advocated, I am leaning towards simply removing the Affinity provider completely and just provide configuration examples to the user in documentation to educate them on the possible use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm supportive of spreading by arbitrary labels. A common use case on bare metal is spreading across racks (power and network domain) and bus bars (power domain), for example.
I would just prefer not to use region and zone as examples, nor as defaults.
Another possibility is that we could support a few generic topology levels (level1, level2, level3, ..., level6) by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since generic nested topological levels have been implemented, I would be wary of taking a step back and restricting the number of nested levels.
And I agree, I will remove the affinity provider for now and just have affinity/anti-affinity at the different levels be generically defined in documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to imply that we should restrict the number of levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhgupta That sounds reasonable, please ping me when you've made the change you described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danmcp: Do you need anything here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgrant0607 Since you mentioned @danmcp I would just like to get your quick input on something. If it turns out to be a more involved question, I'll create a separate issue.
Dan had asked me why we used services instead of replication controllers for identifying peer pods for spreading. The argument is that services might not be needed and could be avoided in some cases, whereas replication controller will most likely always be there. Technically, you could have a set of individual pods placed behind a service or pods from multiple replication controllers placed behind a service - but what is the most convenient/practical approach?
Thoughts???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgrant0607 I would only like to clarify the statements around not running kubernetes across zone. I believe the statement is that you should run kubernetes across small enough area that you can afford to lose (at least the master tier for some time period) and that has low enough network latency to communicate master <-> node and node <-> node for your applications. For some deployments that might mean cluster can't cross zone. For others, cross zone would be desired for higher availability of applications without having to maintain multiple kubernetes clusters. And the more highly available kubernetes masters are, the more likely cross zone would be popular for smaller deployments.
@bgrant0607 @davidopp The configuration file is something that the admin will place and make accessible to the scheduler. Typically, I would assume this would be handled by configuration management tools like puppet/chef/ansible/salt/etc. Either ways, the configuration file is optional and the DefaultProvider comes into play, ensuring that this does not become a gating factor for simple use cases. |
Sorry for the delay, taking a look now. |
if err != nil { | ||
return nil, fmt.Errorf("Unable to read policy config: %v", err) | ||
} | ||
//err = json.Unmarshal(configData, &policy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove before submitting.
At a high level I think this PR is great, my main concern is that config files are almost certainly not the way we want to go long-term for component configs. We want something more like #1627, where we use config objects in the store. Unfortunately that effort has fallen victim to a bit of analysis paralysis and I don't see it happening before 1.0 (though I will ping that issue). It would not be hard to adapt what you've done here to use a config source other than file, but I'm worried about people who might start using the file-based config if/when we switch to a different config mechanism. |
a524824
to
2766657
Compare
@bgrant0607 @davidopp Have made the other changes based on review/feedback. Ideally I would like to get this PR merged soon since the current enhancements to predicates/priorities are not consumable without this PR. Getting this PR in would allow some much needed testing for the non-default scheduler predicates/priorities. Are we blocking this PR in its current form or do we agree that we can let it be currently merged and convert to secrets later? |
I think we can convert to the new config mechanism later, yes. If we implement the secret-like approach, the data will be populated in a volume, so this mechanism might not even require changes. |
@abhgupta @danmcp: why use services instead of replication controllers for identifying peer pods for spreading. Discussed in #1965 and #2312. Replication controllers are associated with deployments. We expect there to be multiple per service: canaries, rolling updates, multiple release tracks, etc. What people want is spreading of the whole service. Custom spreading is discussed in #4301 and #367. |
@danmcp: Yes, I agree with your description. |
LGTM |
eba9270
to
add2868
Compare
@davidopp PTAL - I have added test cases that accept a JSON config for the scheduler. This is now ready for a merge. |
LGTM |
@davidopp I modified the test case to remove the dependency on the algorithm provider. Importing the "defaults" algorithm provider was creating an import loop and all I needed for the test case was to test the combination of configurable and pre-defined predicates/priorities. |
Configuring scheduler via json configuration file
This PR introduces the capability to configure the scheduler via a json file. The API is versioned. This is the implementation for the issue --> #4303