Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement alpha version of PreferAvoidPods #20699

Merged

Conversation

jiangyaoguo
Copy link
Contributor

@jiangyaoguo jiangyaoguo commented Feb 5, 2016

This is part of #18853


This change is Reviewable

@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 5, 2016
@jiangyaoguo jiangyaoguo force-pushed the add-AvoidPreviousNode branch from c9bc79d to 449d956 Compare February 5, 2016 09:33
@k8s-github-robot
Copy link

The author of this PR is not in the whitelist for merge, can one of the admins add the 'ok-to-merge' label?

@k8s-github-robot
Copy link

PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2016
@thockin thockin assigned davidopp and unassigned thockin Feb 5, 2016
@jiangyaoguo jiangyaoguo force-pushed the add-AvoidPreviousNode branch from 449d956 to 25e45b4 Compare February 6, 2016 01:29
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2016
@k8s-github-robot
Copy link

PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2016
@davidopp
Copy link
Member

davidopp commented Feb 7, 2016

@@ -2229,6 +2229,11 @@ type DeleteOptions struct {
// specified type will be used.
// Defaults to a per object value if not specified. zero means delete immediately.
GracePeriodSeconds *int64 `json:"gracePeriodSeconds"`

// If true, when deleting the pod managed by a replication controller, scheduler will try to put the replacement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly different wording that might be a bit clearer: "If the object to be deleted is a pod managed by a controller, setting to true requests that the replacement pod that the controller creates (if the controller decides to create such a pod) be scheduled to a different node than the one where it is currently running. Otherwise has no effect."

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 7, 2016 via email

@davidopp
Copy link
Member

davidopp commented Feb 7, 2016

Here are the two comments I think you're referring to:

#18853 (comment)
#18853 (comment)

I interpreted these comments to mean that anything beyond what @jiangyaoguo was a "nice to have."

Do we foresee other Pod-specific delete options? It doesn't seem so bad to just put in a comment "has no effect when applied to non-Pod" unless there are going to be multiple other scenarios like this.

cc/ @lavalamp

@smarterclayton
Copy link
Contributor

We can always change the schema in the future to take away the flag since
it would have no effect, but we wouldn't be able to be stricter in
validating incoming requests, because the field would be accepted by
version 1.2 (even if ignored). So it paints us into a corner a little bit
more than today.

On Sun, Feb 7, 2016 at 2:55 PM, David Oppenheimer notifications@github.com
wrote:

Here are the two comments I think you're referring to:

#18853 (comment)
#18853 (comment)
#18853 (comment)
#18853 (comment)

I interpreted these comments to mean that anything beyond what
@jiangyaoguo https://github.com/jiangyaoguo was a "nice to have."

Do we foresee other Pod-specific delete options? It doesn't seem so bad to
just put in a comment "has no effect when applied to non-Pod" unless there
are going to be multiple other scenarios like this.

cc/ @lavalamp https://github.com/lavalamp


Reply to this email directly or view it on GitHub
#20699 (comment)
.

@jiangyaoguo jiangyaoguo force-pushed the add-AvoidPreviousNode branch from 25e45b4 to 162c832 Compare February 8, 2016 03:06
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 8, 2016
@k8s-github-robot
Copy link

@jiangyaoguo
You must link to the test flake issue which caused you to request this manual re-test.
Re-test requests should be in the form of: k8s-bot test this issue: #<number>
Here is the list of open test flakes.

@k8s-github-robot
Copy link

PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 8, 2016
@lavalamp
Copy link
Member

lavalamp commented Feb 9, 2016

I would really prefer to make a specific PodDeletionOptions object. How urgent is this?

@davidopp davidopp changed the title Add AvoidPreviousNodeto api.DeleteOptions Add AvoidPreviousNode to api.DeleteOptions Feb 10, 2016
@davidopp
Copy link
Member

Quick notes from conversation with @lavalamp today:

  • The fact that AvoidPreviousNode was requested in the api.DeleteOptions needs to be persisted somewhere by the API server, presumably an annotation on the pod that it then deletes. An alternative would be to just have the client who requests the delete add the annotation before doing the delete, and not bother with adding this option to DeleteOptions at all.
  • More controversial alternatives
    • have a new API resource that records the information; it would be available to all controllers (though it's not clear that other controllers than the one that created the pod need the information), and could eventually generalize to a RateLimiter-type object that records pod-machine pairs to avoid
    • similar to previous sub-bullet but put the information in the Node object rather than a new object

I'm pretty opposed to the "more controversial alternatives." It seems overly complicated, at least in the sense that associating the "avoid previous machine" information with the pod that is being deleted avoids any concern about garbage collecting the information, as compared to storing it in some global structure (as in either of the two "more controversial alternatives").

I'm on the fence about the first bullet.

Urgency: This is not super-urgent; it blocks deployment of rescheduler, but we haven't even started implementing rescheduler.

@jiangyaoguo jiangyaoguo force-pushed the add-AvoidPreviousNode branch from 162c832 to 8dd3bbd Compare February 15, 2016 06:27
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 15, 2016
@jiangyaoguo jiangyaoguo force-pushed the add-AvoidPreviousNode branch from d409168 to 1dacdd3 Compare July 4, 2016 12:53
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 4, 2016
@jiangyaoguo
Copy link
Contributor Author

@davidopp rebased.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 8, 2016
@davidopp
Copy link
Member

davidopp commented Jul 9, 2016

LGTM

Very sorry for the delay in reviewing. Please rebase one more time, and then I will add LGTM label.

Thanks for your work on this!

k8s-github-robot pushed a commit that referenced this pull request Jul 10, 2016
Automatic merge from submit-queue

[WIP/RFC] Rescheduling in Kubernetes design proposal

Proposal by @bgrant0607 and @davidopp (and inspired by years of discussion and experience from folks who worked on Borg and Omega).

This doc is a proposal for a set of inter-related concepts related to "rescheduling" -- that is, "moving" an already-running pod to a new node in order to improve where it is running. (Specific concepts discussed are priority, preemption, disruption budget, quota, `/evict` subresource, and rescheduler.)

Feedback on the proposal is very welcome. For now, please stick to comments about the design, not spelling, punctuation, grammar, broken links, etc., so we can keep the doc uncluttered enough to make it easy for folks to comment on the more important things. 

ref/ #22054 #18724 #19080 #12611 #20699 #17393 #12140 #22212

@HaiyangDING @mqliang @derekwaynecarr @kubernetes/sig-scheduling @kubernetes/huawei @timothysc @mml @dchen1107
@jiangyaoguo jiangyaoguo force-pushed the add-AvoidPreviousNode branch from 1dacdd3 to abd777f Compare July 12, 2016 05:43
@davidopp
Copy link
Member

Please squash the commits to one commit.

@jiangyaoguo jiangyaoguo force-pushed the add-AvoidPreviousNode branch from abd777f to 658ed48 Compare July 13, 2016 08:16
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 13, 2016
@davidopp
Copy link
Member

@jiangyaoguo your new commit has a compiler error:

# k8s.io/kubernetes/plugin/pkg/scheduler/algorithm/priorities
plugin/pkg/scheduler/algorithm/priorities/priorities.go:307: nodes.Items undefined (type []*"k8s.io/kubernetes/pkg/api".Node has no field or method Items)
plugin/pkg/scheduler/algorithm/priorities/priorities.go:314: nodes.Items undefined (type []*"k8s.io/kubernetes/pkg/api".Node has no field or method Items)

I didn't look carefully, but it might be that you need to update your code to incorporate the change made in #28781

@jiangyaoguo
Copy link
Contributor Author

@davidopp I encounter some network problem now. Will update soon.

…fic node.

1. define PreferAvoidPods annotation
2. add PreferAvoidPodsPriority
3. validate AvoidPods in node annotations
@jiangyaoguo jiangyaoguo force-pushed the add-AvoidPreviousNode branch from 658ed48 to 4e91166 Compare July 13, 2016 09:13
@k8s-bot
Copy link

k8s-bot commented Jul 13, 2016

GCE e2e build/test passed for commit 4e91166.

@jiangyaoguo
Copy link
Contributor Author

@davidopp

return nodePreferAvoid.CalculateNodePreferAvoidPodsPriority
}

func (npa *NodePreferAvoidPod) CalculateNodePreferAvoidPodsPriority(pod *api.Pod, nodeNameToInfo map[string]*schedulercache.NodeInfo, nodeLister algorithm.NodeLister) (schedulerapi.HostPriorityList, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp @jiangyaoguo - do we know the preformance impact of it? Since you are switching it on by default, it would be good to know...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we haven't tested performance impact. What would you think of the following:

  1. for now, at the beginning of the function put a loop over all nodes, that checks to see if
api.GetAvoidPodsFromNodeAnnotations(node.Annotations)

has a value for any of the nodes. If it is not set for any nodes, the function immediately returns a HostPiorityList with all nodes having same priority. This way the significant performance impact only happens if you actually use the feature, and we can warn users about it.

  1. hopefully ownerReference will be implemented for ReplicaSet and ReplicationController by the time of code freeze for 1.4 (RC is already in the final stages of review, will merge very soon), and then we can use it instead of using rcs, err := npa.controllerLister.GetPodControllers(pod) (and similar one for ReplicaSets).

Does that sound reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's do this way:

  1. let's merge as it is now (basically in selector spreading we are still calling all the functions that we have here, so it shouldn't be worse)
  2. Once this is merged, I will do a pass over it and fix some issues that I already fixed for other priorities/predicates
  3. once ControllerRef is ready, I will switch to that (together with selector_spreading)

[Generally, the asymptotical complexity of it is the same as selector spreading, We should probably solve both of them in the same way in the future.]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Thanks!

@davidopp davidopp added release-note Denotes a PR that will be considered when it comes time to generate release notes. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jul 14, 2016
@davidopp
Copy link
Member

LGTM

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

@k8s-bot test this issue: #IGNORE

Tests have been pending for 24 hours

@k8s-bot
Copy link

k8s-bot commented Jul 14, 2016

GCE e2e build/test passed for commit 4e91166.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 06939c5 into kubernetes:master Jul 14, 2016
@k8s-bot
Copy link

k8s-bot commented Jul 14, 2016

GCE e2e build/test passed for commit 4e91166.

@vishh
Copy link
Contributor

vishh commented Jun 21, 2017

@davidopp Is this feature documented anywhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.