Add support to modify precomputed predicate metadata upon adding/removal of a pod #50805

bsalamat · 2017-08-16T22:26:51Z

What this PR does / why we need it: This PR adds capability to change precomputed predicate metadata and let's us add/remove pods to the precomputed metadata efficiently without the need ot recomputing everything upon addition/removal of pods. This PR is needed as a part of adding preemption logic to the scheduler.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:
To make the review process a bit easier, there are three commits. The cleanup commit is only moving code and renaming some functions, without logic changes.

Release note:

NONE

ref/ #47604
ref/ #48646

/assign @wojtek-t

@kubernetes/sig-scheduling-pr-reviews @davidopp

wojtek-t

I skimmed through the PR and it seems reasonable to me.
I will try to review it in the upcoming days.

wojtek-t · 2017-08-17T07:20:22Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	podBestEffort                      bool
+	podRequest                         *schedulercache.Resource
+	podPorts                           map[int]bool
+	matchingAntiAffinityTerms          map[types.UID][]matchingPodAntiAffinityTerm //key is a pod UID with the anti-affinity rule.


nit: please avoid such long lines

wojtek-t · 2017-08-17T07:22:28Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	podRequest                         *schedulercache.Resource
+	podPorts                           map[int]bool
+	matchingAntiAffinityTerms          map[types.UID][]matchingPodAntiAffinityTerm //key is a pod UID with the anti-affinity rule.
+	serviceAffinityInUse               bool


Why do we need this bool?
If the two slices below are nulls, it should have the same effect as setting this to false. In that case, I would prefer to remove this to avoid growing this structure infinitely - it's already pretty big.

This bool tells us whether the service affinity predicate is actually enabled. The affinity predicate may be enabled, but the two slices may still be empty because the pod matches no services and no other pods with the same labels exist.

wojtek-t · 2017-08-17T07:27:15Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+		deletedPodIndex := -1
+		for i, pod := range meta.serviceAffinityMatchingPodList {
+			if pod.GetUID() == deletedPod.GetUID() {
+				deletedPodIndex = i


This is overcomplicated. You simply should move the body of the below if here and get rid of "deletedPodIndex".

Done. I was trying to avoid intervening with the for loop iterator, but it doesn't matter as we are breaking right after removing the element.

wojtek-t · 2017-08-17T07:30:30Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+
+// AddPod changes predicateMetadata assuming that `newPod` is added to the
+// system.
+func (meta *predicateMetadata) AddPod(addedPod *v1.Pod, nodeInfo *schedulercache.NodeInfo) error {


Do we really need this method?

My personal feeling is that RemovePod is the only think you need to implement preemption. Why do you want to add pods?

We do need this. The preemption first removes all the lower priority pods and after adding the pending pod, tries to reprieve as many removed pods as possible. This method is useful when adding the pending pod and adding those removed pods back.

wojtek-t · 2017-08-17T07:31:10Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	// Add matching anti-affinity terms of the addedPod to the map.
+	if podMatchingTerms, err := getMatchingAntiAffinityTermsOfExistingPod(meta.pod, addedPod, nodeInfo.Node()); err == nil {
+		if len(podMatchingTerms) > 0 {
+			meta.matchingAntiAffinityTerms[addedPodUID] = append(meta.matchingAntiAffinityTerms[addedPodUID], podMatchingTerms...)


This shouldn't be append in my opinion - there should be exactly one pod with a given UID - if there is already something for the uid, this seems like a bug.

There is going to be only one pod, but that one pod may have a few matching anti-affinity terms.

wojtek-t · 2017-08-17T07:32:42Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+			return false
+		}
+	}
+	return true


This only works for lists without duplicated. It will return true for [1, 1, 2] and [1, 2, 2] (which may or may not be a problem in that case, but definitely deserves a comment).

Deleted the function. It was unused.

wojtek-t · 2017-08-17T07:33:29Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+// predicateMetadataEquivalent returns true if the two metadata are equivalent.
+// Note: this function does not compare podRequest.
+func predicateMetadataEquivalent(meta1, meta2 *predicateMetadata) error {
+	if meta1.pod != meta2.pod {


Should we really compare pointers to pods here?

Shouldn't it be DeepEqual?

In this test environment where we pass the same pod pointer to the metadata producers, this is fine. I changed it to DeepEqual anyway to avoid confusion.

bsalamat

Thanks a lot @wojtek-t for your prompt review.

bsalamat · 2017-08-17T08:21:46Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	podRequest                         *schedulercache.Resource
+	podPorts                           map[int]bool
+	matchingAntiAffinityTerms          map[types.UID][]matchingPodAntiAffinityTerm //key is a pod UID with the anti-affinity rule.
+	serviceAffinityInUse               bool


This bool tells us whether the service affinity predicate is actually enabled. The affinity predicate may be enabled, but the two slices may still be empty because the pod matches no services and no other pods with the same labels exist.

bsalamat · 2017-08-17T08:24:53Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+		deletedPodIndex := -1
+		for i, pod := range meta.serviceAffinityMatchingPodList {
+			if pod.GetUID() == deletedPod.GetUID() {
+				deletedPodIndex = i


Done. I was trying to avoid intervening with the for loop iterator, but it doesn't matter as we are breaking right after removing the element.

bsalamat · 2017-08-17T08:28:03Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+
+// AddPod changes predicateMetadata assuming that `newPod` is added to the
+// system.
+func (meta *predicateMetadata) AddPod(addedPod *v1.Pod, nodeInfo *schedulercache.NodeInfo) error {


We do need this. The preemption first removes all the lower priority pods and after adding the pending pod, tries to reprieve as many removed pods as possible. This method is useful when adding the pending pod and adding those removed pods back.

bsalamat · 2017-08-17T08:28:53Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	// Add matching anti-affinity terms of the addedPod to the map.
+	if podMatchingTerms, err := getMatchingAntiAffinityTermsOfExistingPod(meta.pod, addedPod, nodeInfo.Node()); err == nil {
+		if len(podMatchingTerms) > 0 {
+			meta.matchingAntiAffinityTerms[addedPodUID] = append(meta.matchingAntiAffinityTerms[addedPodUID], podMatchingTerms...)


There is going to be only one pod, but that one pod may have a few matching anti-affinity terms.

bsalamat · 2017-08-17T08:30:46Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+			return false
+		}
+	}
+	return true


Deleted the function. It was unused.

bsalamat · 2017-08-17T08:32:20Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+// predicateMetadataEquivalent returns true if the two metadata are equivalent.
+// Note: this function does not compare podRequest.
+func predicateMetadataEquivalent(meta1, meta2 *predicateMetadata) error {
+	if meta1.pod != meta2.pod {


In this test environment where we pass the same pod pointer to the metadata producers, this is fine. I changed it to DeepEqual anyway to avoid confusion.

bsalamat · 2017-08-18T03:26:17Z

@wojtek-t Other than addressing the comments, I also changed the key of the anti-affinity terms from pod UID to pod fullname, in a separate commit. Pod UID was not consistent with the rest of the codebase and made writing tests a bit more difficult.
Could you please take another look?

bsalamat · 2017-08-21T16:45:53Z

ping @wojtek-t. Could you please take another look at your earliest convenience. All of my other PRs to add preemption depend on this one.

wojtek-t · 2017-08-22T08:18:28Z

Yes, sorry. Will take a look into it today.

wojtek-t

I added some comments.

Also pleas squash commits - I'm reviewing the whole PR anyway, because I never remember what I did before :P

wojtek-t · 2017-08-22T08:25:19Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	podPorts      map[int]bool
+	//key is a pod full name with the anti-affinity rules.
+	matchingAntiAffinityTerms          map[string][]matchingPodAntiAffinityTerm
+	serviceAffinityInUse               bool


This bool tells us whether the service affinity predicate is actually enabled. The affinity predicate may be enabled, but the two slices may still be empty because the pod matches no services and no other pods with the same labels exist.

Correct. But in that case, I think that if both slices are empty no matter if this bool is true or false, the results will be exactly the same.
So basically, what I'm saying is that removing this doesn't change anything and will make the code a bit easier :)

We would still need to process a new pod in AddPod even when the two slices were empty if we didn't have this bool. This bool helps avoid that.
Since service affinity is not one of the default predicates and is not enabled in most deployments, we would like to avoid wasting cpu cycles processing the rules unnecessarily.

OK - the argument with AddPod is reasonable.

SGTM

wojtek-t · 2017-08-22T08:26:37Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	// So, if the namespace of the first one is not the same as the namespace of the
+	// deletedPod, we don't need to check the list, as deletedPod isn't in the list.
+	if meta.serviceAffinityInUse &&
+		len(meta.serviceAffinityMatchingPodList) > 0 &&


This condition is not needed - if the length is zero, the for inside if will simply be no-op.
Let's remove it.

[In Go you can safely iterate over nil slices].

The check is there to protect against indexing an empty slice in the same "if" condition.

wojtek-t · 2017-08-22T08:27:30Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+		deletedPod.Namespace == meta.serviceAffinityMatchingPodList[0].Namespace {
+		for i, pod := range meta.serviceAffinityMatchingPodList {
+			if schedutil.GetPodFullName(pod) == deletedPodFullName {
+				meta.serviceAffinityMatchingPodList = append(meta.serviceAffinityMatchingPodList[:i], meta.serviceAffinityMatchingPodList[i+1:]...)


nit: please split this line - it's too long now.

wojtek-t · 2017-08-22T08:28:08Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	// Add matching anti-affinity terms of the addedPod to the map.
+	if podMatchingTerms, err := getMatchingAntiAffinityTermsOfExistingPod(meta.pod, addedPod, nodeInfo.Node()); err == nil {
+		if len(podMatchingTerms) > 0 {
+			meta.matchingAntiAffinityTerms[addedPodFullName] = append(meta.matchingAntiAffinityTerms[addedPodFullName], podMatchingTerms...)


nit: please split this line

wojtek-t · 2017-08-22T08:30:40Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	}
+	// Add matching anti-affinity terms of the addedPod to the map.
+	if podMatchingTerms, err := getMatchingAntiAffinityTermsOfExistingPod(meta.pod, addedPod, nodeInfo.Node()); err == nil {
+		if len(podMatchingTerms) > 0 {


Please remove this if - it's not neded, you can do it unconditionally:

package main import ( "fmt" ) func main() { a := []string{"aa"} var b []string c := []string{} d := []string{"ee"} x := append(a, b...) x = append(x, c...) x = append(x, d...) fmt.Printf("%#v", x) }

When running the tests, the entry may not exist in the map and that's why I had added that initial check. I changed the code to make it clear.

wojtek-t · 2017-08-22T08:39:59Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+			return fmt.Errorf("podPorts are not equal.")
+		}
+	}
+	for k1, v1 := range meta1.matchingAntiAffinityTerms {


This is test - we don't need to be super efficient here - simpler code is more important here.

So instead of that, we can compare two slices by:

sorting both of them

and then calling DeepEqual

[Yes - I know you need to write comparator for sorting, but this method will be so much simpler then].

The same below for service-afffinity things.

It might be, but sort interface needs three functions. It should be done for the service affinity as well. We should also be careful that the comparison function almost never generates equal cases, as the order of equal elements in the expected list may not necessarily be the same as what the tests generate and could cause false positives. Given all this, probably sorting and doing DeepEqual may need even more code.

wojtek-t · 2017-08-22T08:43:50Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+		allPodLister := schedulertesting.FakePodLister(append(test.existingPods, test.addedPod))
+		// getMeta creates predicate meta data given the list of pods.
+		getMeta := func(lister schedulertesting.FakePodLister) (*predicateMetadata, map[string]*schedulercache.NodeInfo) {
+			nodeInfoMap := map[string]*schedulercache.NodeInfo{}


Please split this into a separate utility function (in this file).

I mean splitting the part computing nodeInfoMap.

I think that we used to have such function in the past, but I can't find it now...

I found the function and used it.

wojtek-t · 2017-08-22T08:47:39Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

@@ -1105,7 +1077,8 @@ func getMatchingAntiAffinityTerms(pod *v1.Pod, nodeInfoMap map[string]*scheduler
 					return
 				}
 				if priorityutil.PodMatchesTermsNamespaceAndSelector(pod, namespaces, selector) {
-					nodeResult = append(nodeResult, matchingPodAntiAffinityTerm{term: &term, node: node})
+					existingPodFullName := schedutil.GetPodFullName(existingPod)
+					nodeResult[existingPodFullName] = append(nodeResult[existingPodFullName], matchingPodAntiAffinityTerm{term: &term, node: node})


nit: please split the line

wojtek-t · 2017-08-22T08:49:39Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

 	if predicateMeta, ok := meta.(*predicateMetadata); ok {
 		matchingTerms = predicateMeta.matchingAntiAffinityTerms
 	} else {
-		allPods, err := c.podLister.List(labels.Everything())
+		filteredPods, err := c.podLister.FilteredList(nodeInfo, labels.Everything())


those actually are "allPods" - labels.Everything won't filter anything, right?

They are filtered by nodeInfo. Only the pods that exist in the nodeInfo are returned here.

wojtek-t · 2017-08-22T08:53:27Z

plugin/pkg/scheduler/schedulercache/cache.go

+// PodFilter is a simple interface that helps filter out pods.
+// This is here only to avoid circular dependency between schedulercache and
+// algorithm. Ideally, it should have been in algorithm/types.go.
+type PodFilter interface {


I'm not a fan of this approach. I would prefer defining:

type func(*v1.Pod) bool PodFilter

and passing explicitly nodeInfo.Filter() to FilteredList.

Or do you have any strong reason of not doing that?

No strong reason. Done as you suggested.

bsalamat

Thanks a lot, @wojtek-t!
Please take another look.

bsalamat · 2017-08-22T17:24:04Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	podPorts      map[int]bool
+	//key is a pod full name with the anti-affinity rules.
+	matchingAntiAffinityTerms          map[string][]matchingPodAntiAffinityTerm
+	serviceAffinityInUse               bool


We would still need to process a new pod in AddPod even when the two slices were empty if we didn't have this bool. This bool helps avoid that.
Since service affinity is not one of the default predicates and is not enabled in most deployments, we would like to avoid wasting cpu cycles processing the rules unnecessarily.

bsalamat · 2017-08-22T17:28:26Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	// So, if the namespace of the first one is not the same as the namespace of the
+	// deletedPod, we don't need to check the list, as deletedPod isn't in the list.
+	if meta.serviceAffinityInUse &&
+		len(meta.serviceAffinityMatchingPodList) > 0 &&


The check is there to protect against indexing an empty slice in the same "if" condition.

bsalamat · 2017-08-22T17:28:57Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+		deletedPod.Namespace == meta.serviceAffinityMatchingPodList[0].Namespace {
+		for i, pod := range meta.serviceAffinityMatchingPodList {
+			if schedutil.GetPodFullName(pod) == deletedPodFullName {
+				meta.serviceAffinityMatchingPodList = append(meta.serviceAffinityMatchingPodList[:i], meta.serviceAffinityMatchingPodList[i+1:]...)


bsalamat · 2017-08-22T17:30:47Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	}
+	// Add matching anti-affinity terms of the addedPod to the map.
+	if podMatchingTerms, err := getMatchingAntiAffinityTermsOfExistingPod(meta.pod, addedPod, nodeInfo.Node()); err == nil {
+		if len(podMatchingTerms) > 0 {


bsalamat · 2017-08-22T17:35:04Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	// Add matching anti-affinity terms of the addedPod to the map.
+	if podMatchingTerms, err := getMatchingAntiAffinityTermsOfExistingPod(meta.pod, addedPod, nodeInfo.Node()); err == nil {
+		if len(podMatchingTerms) > 0 {
+			meta.matchingAntiAffinityTerms[addedPodFullName] = append(meta.matchingAntiAffinityTerms[addedPodFullName], podMatchingTerms...)


bsalamat · 2017-08-22T18:33:40Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+		allPodLister := schedulertesting.FakePodLister(append(test.existingPods, test.addedPod))
+		// getMeta creates predicate meta data given the list of pods.
+		getMeta := func(lister schedulertesting.FakePodLister) (*predicateMetadata, map[string]*schedulercache.NodeInfo) {
+			nodeInfoMap := map[string]*schedulercache.NodeInfo{}


I found the function and used it.

bsalamat · 2017-08-22T18:46:08Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

@@ -1105,7 +1077,8 @@ func getMatchingAntiAffinityTerms(pod *v1.Pod, nodeInfoMap map[string]*scheduler
 					return
 				}
 				if priorityutil.PodMatchesTermsNamespaceAndSelector(pod, namespaces, selector) {
-					nodeResult = append(nodeResult, matchingPodAntiAffinityTerm{term: &term, node: node})
+					existingPodFullName := schedutil.GetPodFullName(existingPod)
+					nodeResult[existingPodFullName] = append(nodeResult[existingPodFullName], matchingPodAntiAffinityTerm{term: &term, node: node})


bsalamat · 2017-08-22T18:47:29Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

 	if predicateMeta, ok := meta.(*predicateMetadata); ok {
 		matchingTerms = predicateMeta.matchingAntiAffinityTerms
 	} else {
-		allPods, err := c.podLister.List(labels.Everything())
+		filteredPods, err := c.podLister.FilteredList(nodeInfo, labels.Everything())


They are filtered by nodeInfo. Only the pods that exist in the nodeInfo are returned here.

bsalamat · 2017-08-22T18:53:23Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+			return fmt.Errorf("podPorts are not equal.")
+		}
+	}
+	for k1, v1 := range meta1.matchingAntiAffinityTerms {


It might be, but sort interface needs three functions. It should be done for the service affinity as well. We should also be careful that the comparison function almost never generates equal cases, as the order of equal elements in the expected list may not necessarily be the same as what the tests generate and could cause false positives. Given all this, probably sorting and doing DeepEqual may need even more code.

bsalamat · 2017-08-22T19:02:10Z

plugin/pkg/scheduler/schedulercache/cache.go

+// PodFilter is a simple interface that helps filter out pods.
+// This is here only to avoid circular dependency between schedulercache and
+// algorithm. Ideally, it should have been in algorithm/types.go.
+type PodFilter interface {


No strong reason. Done as you suggested.

bsalamat · 2017-08-22T23:39:13Z

/retest

wojtek-t

I added another portion of comments. Most of them are relativley minor, there is no bigger one.

wojtek-t · 2017-08-23T08:05:10Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	podPorts      map[int]bool
+	//key is a pod full name with the anti-affinity rules.
+	matchingAntiAffinityTerms          map[string][]matchingPodAntiAffinityTerm
+	serviceAffinityInUse               bool


OK - the argument with AddPod is reasonable.

SGTM

wojtek-t · 2017-08-23T08:11:01Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+		if found {
+			existingTerms = append(existingTerms, podMatchingTerms...)
+		} else {
+			if len(podMatchingTerms) > 0 {


This should be checked in the outer if - there is no point in doing all those operations if this is empty.

if len(podMatchingTerms) > 0 { if existingTerms, found := meta.matchingAntiAffinityTerms[addedPodFullName]; found { existingTerms = append(existingTerms, podMatchingTerms...) } else { meta.matchingAntiAffinityTerms[addedPodFullName] = podMatchingTerms } }

wojtek-t · 2017-08-23T08:12:18Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+		return fmt.Errorf("Invalid node in nodeInfo.")
+	}
+	// Add matching anti-affinity terms of the addedPod to the map.
+	if podMatchingTerms, err := getMatchingAntiAffinityTermsOfExistingPod(meta.pod, addedPod, nodeInfo.Node()); err == nil {


would prefer to limit the amount of nested ifs. So it would be easier to follow (and more standard) to do:

podMatchingTerms, err := getMatchingAntiAffinityTermsOfExistingPod(meta.pod, addedPod, nodeInfo.Node());
if err != nil {
return err
}
// The rest of code.

wojtek-t · 2017-08-23T08:12:51Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	// of matching pods if applicable.
+	if meta.serviceAffinityInUse && addedPod.Namespace == meta.pod.Namespace {
+		selector := CreateSelectorFromLabels(meta.pod.Labels)
+		if selector.Matches(labels.Set(addedPod.Labels)) {


Yeah - sorry.

This predicate surprises me every time I'm seeing it....

wojtek-t · 2017-08-23T08:16:42Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+	for !reflect.DeepEqual(meta1.podPorts, meta2.podPorts) {
+		return fmt.Errorf("podPorts are not equal.")
+	}
+	for k1, v1 := range meta1.matchingAntiAffinityTerms {


It might be, but sort interface needs three functions. It should be done for the service affinity as well. We should also be careful that the comparison function almost never generates equal cases, as the order of equal elements in the expected list may not necessarily be the same as what the tests generate and could cause false positives. Given all this, probably sorting and doing DeepEqual may need even more code.

I agree that sort needs 3 function. But those are super trivial function and well separated.

And yes - I agree that in total this might be a bit more code, but it will be better structured and easier to follow.
This function now has 70 lines of code and it's hard to follow it.
And what if we add few more fields to predicateMetadata? We may end up with comparison function having hundreds of line.

If you really don't want to do the sorting (though i think it's easier and less error prone), at the very least please extract comparing those slices to indviidual functions with a TODO to simplify them.

Alright. Done. As expected there is more code now, but I agree that it is probably easier to follow.

wojtek-t · 2017-08-23T08:20:30Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

+func (c *PodAffinityChecker) getMatchingAntiAffinityTerms(pod *v1.Pod, allPods []*v1.Pod) (map[string][]matchingPodAntiAffinityTerm, error) {
+	result := make(map[string][]matchingPodAntiAffinityTerm)
+	for _, existingPod := range allPods {
+		existingPodNode, err := c.info.GetNodeInfo(existingPod.Spec.NodeName)


This may affect performance, because with this change you're computing it for every pod, and before we were doing it only for pods with affinity/anti-affinity.

I think we should change it back to how it was.

Good point. Added the check back.

wojtek-t · 2017-08-23T08:21:18Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

+		if err != nil {
+			return nil, err
+		}
+		existingPodFullName := schedutil.GetPodFullName(existingPod)


if len(existingPodMatchingTerms) > 0 {
...
}

wojtek-t · 2017-08-23T08:22:58Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

-				podName(pod), node.Name, term.term)
-			return false
+	for _, terms := range matchingTerms {
+		for _, term := range terms {


This will result in creating a copy of each of term.

Instead you should do:

for i := range terms { term := &terms[i] ...

I changed it as you suggested. However, each "term" contains only two pointers. So copying it is more or less as efficient as getting the address, but getting the address is better if the struct grows in the future.

wojtek-t · 2017-08-23T08:25:10Z

plugin/pkg/scheduler/schedulercache/cache.go

@@ -106,6 +106,20 @@ func (cache *schedulerCache) List(selector labels.Selector) ([]*v1.Pod, error) {
 	return pods, nil
 }

+func (cache *schedulerCache) FilteredList(podFilter PodFilter, selector labels.Selector) ([]*v1.Pod, error) {


Instead of duplicating the code please do:

func (cache *schedulerCache) List(selector labels.Selector) ([]*v1.Pod, error) { return cache.filteredList(selector, alwaysTrue) } func (cache *schedulerCache) FilteredList(podFilter PodFilter, selector labels.Selector) ([]*v1.Pod, error) { return cache.filteredList(selector, podFilter) } func (cache *schedulerCache) filteredList(podFilter PodFilter, selector labels.Selector) ([]*v1.Pod, error) { // as you FilterList is now }

Done, but I didn't add filteredList. I just called FilteredList from List. Please let me know if you think adding filteredList is needed?

wojtek-t · 2017-08-23T08:28:02Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

 	if predicateMeta, ok := meta.(*predicateMetadata); ok {
 		matchingTerms = predicateMeta.matchingAntiAffinityTerms
 	} else {
-		allPods, err := c.podLister.List(labels.Everything())
+		filteredPods, err := c.podLister.FilteredList(nodeInfo.Filter, labels.Everything())


Wait - this is completely changing the logic.

With nodeInfo.Filter you will only get pods from a given node, not all pods. That is not correct in my opinion.

Please note that the nodeInfo.Filter "passes" all the pods whose nodeName is not the same as the NodeInfo node name. It only filters out the pods that are on this node (according to their nodeName), but are not present in the NodeInfo.Pods. I added a comment to clarify that.
Basically, we use this function to simulate removal of pods from the system by creating a copy of the real NodeInfo and removing the pods from the NodeInfo copy and then passing the copy to this function.

bsalamat

Thanks, @wojtek-t! PTAL.

bsalamat · 2017-08-23T18:08:20Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+		if found {
+			existingTerms = append(existingTerms, podMatchingTerms...)
+		} else {
+			if len(podMatchingTerms) > 0 {


bsalamat · 2017-08-23T18:10:13Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+		return fmt.Errorf("Invalid node in nodeInfo.")
+	}
+	// Add matching anti-affinity terms of the addedPod to the map.
+	if podMatchingTerms, err := getMatchingAntiAffinityTermsOfExistingPod(meta.pod, addedPod, nodeInfo.Node()); err == nil {


bsalamat · 2017-08-23T18:12:28Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	// of matching pods if applicable.
+	if meta.serviceAffinityInUse && addedPod.Namespace == meta.pod.Namespace {
+		selector := CreateSelectorFromLabels(meta.pod.Labels)
+		if selector.Matches(labels.Set(addedPod.Labels)) {


Me too. I was actually telling davidopp the same thing several days ago.

bsalamat · 2017-08-23T18:20:16Z

plugin/pkg/scheduler/schedulercache/cache.go

@@ -106,6 +106,20 @@ func (cache *schedulerCache) List(selector labels.Selector) ([]*v1.Pod, error) {
 	return pods, nil
 }

+func (cache *schedulerCache) FilteredList(podFilter PodFilter, selector labels.Selector) ([]*v1.Pod, error) {


Done, but I didn't add filteredList. I just called FilteredList from List. Please let me know if you think adding filteredList is needed?

bsalamat · 2017-08-23T18:45:23Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

+func (c *PodAffinityChecker) getMatchingAntiAffinityTerms(pod *v1.Pod, allPods []*v1.Pod) (map[string][]matchingPodAntiAffinityTerm, error) {
+	result := make(map[string][]matchingPodAntiAffinityTerm)
+	for _, existingPod := range allPods {
+		existingPodNode, err := c.info.GetNodeInfo(existingPod.Spec.NodeName)


Good point. Added the check back.

bsalamat · 2017-08-23T18:48:59Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

+		if err != nil {
+			return nil, err
+		}
+		existingPodFullName := schedutil.GetPodFullName(existingPod)


bsalamat · 2017-08-23T18:54:48Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

 	if predicateMeta, ok := meta.(*predicateMetadata); ok {
 		matchingTerms = predicateMeta.matchingAntiAffinityTerms
 	} else {
-		allPods, err := c.podLister.List(labels.Everything())
+		filteredPods, err := c.podLister.FilteredList(nodeInfo.Filter, labels.Everything())


Please note that the nodeInfo.Filter "passes" all the pods whose nodeName is not the same as the NodeInfo node name. It only filters out the pods that are on this node (according to their nodeName), but are not present in the NodeInfo.Pods. I added a comment to clarify that.
Basically, we use this function to simulate removal of pods from the system by creating a copy of the real NodeInfo and removing the pods from the NodeInfo copy and then passing the copy to this function.

bsalamat · 2017-08-23T19:00:21Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

-				podName(pod), node.Name, term.term)
-			return false
+	for _, terms := range matchingTerms {
+		for _, term := range terms {


I changed it as you suggested. However, each "term" contains only two pointers. So copying it is more or less as efficient as getting the address, but getting the address is better if the struct grows in the future.

bsalamat · 2017-08-23T21:21:33Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+	for !reflect.DeepEqual(meta1.podPorts, meta2.podPorts) {
+		return fmt.Errorf("podPorts are not equal.")
+	}
+	for k1, v1 := range meta1.matchingAntiAffinityTerms {


Alright. Done. As expected there is more code now, but I agree that it is probably easier to follow.

wojtek-t

A couple minor comments. I will most probably LGTM once applied. Thanks for bothering with me!

wojtek-t · 2017-08-24T13:16:20Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	if len(podMatchingTerms) > 0 {
+		existingTerms, found := meta.matchingAntiAffinityTerms[addedPodFullName]
+		if found {
+			meta.matchingAntiAffinityTerms[addedPodFullName] = append(existingTerms, podMatchingTerms...)


nit: please split the line

wojtek-t · 2017-08-24T13:17:15Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	// of matching pods if applicable.
+	if meta.serviceAffinityInUse && addedPod.Namespace == meta.pod.Namespace {
+		selector := CreateSelectorFromLabels(meta.pod.Labels)
+		if selector.Matches(labels.Set(addedPod.Labels)) {


In fact, I think we should get rid of this predicate. I'ts unfortunate that we can't do it due to backward compatibility reasons...

wojtek-t · 2017-08-24T13:17:23Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	if meta.serviceAffinityInUse && addedPod.Namespace == meta.pod.Namespace {
+		selector := CreateSelectorFromLabels(meta.pod.Labels)
+		if selector.Matches(labels.Set(addedPod.Labels)) {
+			meta.serviceAffinityMatchingPodList = append(meta.serviceAffinityMatchingPodList, addedPod)


nit: please split the line

Do we have any guidelines on the length of lines? Our code-base is full of lines which are way longer than this. I almost never break lines to be consistent with the rest of the code-base.

wojtek-t · 2017-08-24T13:25:25Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+)
+
+// sortableAntiAffinityTerms lets us to sort anti-affinity terms.
+type sortableAntiAffinityTerms struct {


You can simply define this as:

type sortableAntiAffinityTerms []matchinPodAntiAffinityTerm

Will be simpler. Same for pods below.

wojtek-t · 2017-08-24T13:26:34Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+}
+
+func (s *sortablePods) Less(i, j int) bool {
+	if s.Items[i].Name < s.Items[j].Name {


This isn't necessarily good sorting function.
You can have two pods with the same name but in different namespaces. And those aren't equal.

Should be:

return s[i].Namespace < s[j].Namespace || (s[i].Namespace == s[j].Namespace && s[i].Name < s[j].Name)

Good point. Done

wojtek-t · 2017-08-24T13:26:44Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+}
+
+func (s *sortableServices) Less(i, j int) bool {
+	if s.Items[i].Name < s.Items[j].Name {


wojtek-t · 2017-08-24T13:29:24Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+func (s *sortableAntiAffinityTerms) Less(i, j int) bool {
+	t1, t2 := s.Items[i], s.Items[j]
+	return t1.node.Name < t2.node.Name ||
+			(t1.node.Name == t2.node.Name && len(t1.term.Namespaces) < len(t2.term.Namespaces)) ||


This is a bit unfortunate.
To make it a bit more readable, we can try changing to:

if t1.node.Name != t2.node.Name { return t1.node.Name < t2.node.Name } // and now compare the rest similarly

It's effectively the same, but simpler to follow.

wojtek-t · 2017-08-24T13:29:57Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+	for !reflect.DeepEqual(meta1.podPorts, meta2.podPorts) {
+		return fmt.Errorf("podPorts are not equal.")
+	}
+	sortAntiAffinityTerms(meta1.matchingAntiAffinityTerms)


Thanks - it's much easier to follow now!

bsalamat

Thanks, @wojtek-t! PTAL.

bsalamat · 2017-08-24T18:47:11Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	if len(podMatchingTerms) > 0 {
+		existingTerms, found := meta.matchingAntiAffinityTerms[addedPodFullName]
+		if found {
+			meta.matchingAntiAffinityTerms[addedPodFullName] = append(existingTerms, podMatchingTerms...)


bsalamat · 2017-08-24T18:48:26Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	if meta.serviceAffinityInUse && addedPod.Namespace == meta.pod.Namespace {
+		selector := CreateSelectorFromLabels(meta.pod.Labels)
+		if selector.Matches(labels.Set(addedPod.Labels)) {
+			meta.serviceAffinityMatchingPodList = append(meta.serviceAffinityMatchingPodList, addedPod)


Do we have any guidelines on the length of lines? Our code-base is full of lines which are way longer than this. I almost never break lines to be consistent with the rest of the code-base.

bsalamat · 2017-08-24T18:52:12Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+)
+
+// sortableAntiAffinityTerms lets us to sort anti-affinity terms.
+type sortableAntiAffinityTerms struct {


bsalamat · 2017-08-24T18:55:29Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+}
+
+func (s *sortablePods) Less(i, j int) bool {
+	if s.Items[i].Name < s.Items[j].Name {


Good point. Done

bsalamat · 2017-08-24T18:55:41Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+}
+
+func (s *sortableServices) Less(i, j int) bool {
+	if s.Items[i].Name < s.Items[j].Name {


bsalamat · 2017-08-24T19:02:04Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+func (s *sortableAntiAffinityTerms) Less(i, j int) bool {
+	t1, t2 := s.Items[i], s.Items[j]
+	return t1.node.Name < t2.node.Name ||
+			(t1.node.Name == t2.node.Name && len(t1.term.Namespaces) < len(t2.term.Namespaces)) ||


bsalamat · 2017-08-24T19:10:38Z

plugin/pkg/scheduler/algorithm/predicates/metadata.go

+	// of matching pods if applicable.
+	if meta.serviceAffinityInUse && addedPod.Namespace == meta.pod.Namespace {
+		selector := CreateSelectorFromLabels(meta.pod.Labels)
+		if selector.Matches(labels.Set(addedPod.Labels)) {


Yes. Luckily it is not enabled by default.

wojtek-t

The last minor nit.

Please squash the last commit and I will approve.

wojtek-t · 2017-08-25T08:03:55Z

plugin/pkg/scheduler/algorithm/predicates/metadata_test.go

+	if t1.node.Name != t2.node.Name {
+		return t1.node.Name < t2.node.Name
+	}
+	if t1.node.Name == t2.node.Name && len(t1.term.Namespaces) != len(t2.term.Namespaces) {


You don't need to check if node.Names are equal. They are, otherwise we would return in above if.

Same for everything below.

bsalamat · 2017-08-25T15:28:03Z

Thanks a lot @wojtek-t for taking a lot of time to review this PR.

davidopp · 2017-08-27T22:37:53Z

Thanks a ton for reviewing this @wojtek-t

…ing pods

wojtek-t · 2017-08-28T07:21:56Z

/lgtm

k8s-github-robot · 2017-08-28T07:21:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, wojtek-t

Associated issue: 47604

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~plugin/pkg/scheduler/OWNERS~~ [wojtek-t]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

fejta-bot · 2017-08-28T10:01:31Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

k8s-github-robot · 2017-08-28T11:20:57Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2017-08-28T12:11:18Z

Automatic merge from submit-queue

@davidopp

Automatic merge from submit-queue Add pod preemption to the scheduler **What this PR does / why we need it**: This is the last of a series of PRs to add priority-based preemption to the scheduler. This PR connects the preemption logic to the scheduler workflow. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #48646 **Special notes for your reviewer**: This PR includes other PRs which are under review (#50805, #50405, #50190). All the new code is located in 43627af. **Release note**: ```release-note Add priority-based preemption to the scheduler. ``` ref/ #47604 /assign @davidopp @kubernetes/sig-scheduling-pr-reviews

k8s-ci-robot assigned wojtek-t Aug 16, 2017

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 16, 2017

k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels Aug 16, 2017

bsalamat force-pushed the preemption_metacompute branch 2 times, most recently from a8b6194 to 3c59233 Compare August 17, 2017 02:27

wojtek-t reviewed Aug 17, 2017

View reviewed changes

bsalamat commented Aug 17, 2017

View reviewed changes

bsalamat force-pushed the preemption_metacompute branch 5 times, most recently from d466832 to 29a8d2e Compare August 18, 2017 03:23

This was referenced Aug 18, 2017

Add preemption victim selection logic #50190

Closed

Add node scoring logic for pod preemption #50405

Closed

Add pod preemption to the scheduler #50949

Merged

wojtek-t reviewed Aug 22, 2017

View reviewed changes

bsalamat force-pushed the preemption_metacompute branch from 29a8d2e to 0b5fcaf Compare August 22, 2017 19:16

bsalamat commented Aug 22, 2017

View reviewed changes

bsalamat force-pushed the preemption_metacompute branch from 0b5fcaf to c818b69 Compare August 22, 2017 21:02

wojtek-t reviewed Aug 23, 2017

View reviewed changes

bsalamat commented Aug 23, 2017

View reviewed changes

bsalamat force-pushed the preemption_metacompute branch 2 times, most recently from fcf2d0d to dc57ce8 Compare August 24, 2017 04:59

wojtek-t reviewed Aug 24, 2017

View reviewed changes

bsalamat force-pushed the preemption_metacompute branch from dc57ce8 to eb5c179 Compare August 24, 2017 19:11

bsalamat commented Aug 24, 2017

View reviewed changes

wojtek-t reviewed Aug 25, 2017

View reviewed changes

bsalamat force-pushed the preemption_metacompute branch 2 times, most recently from 357d669 to 0e777e9 Compare August 25, 2017 15:26

bsalamat force-pushed the preemption_metacompute branch 4 times, most recently from 06ebd08 to 8dad409 Compare August 25, 2017 20:07

bsalamat force-pushed the preemption_metacompute branch from 8dad409 to df3796f Compare August 28, 2017 07:02

bsalamat added 2 commits August 28, 2017 00:12

Add support to recompute partial predicate metadata upon adding/remov…

264ca7d

…ing pods

bazel update

87d4065

bsalamat force-pushed the preemption_metacompute branch from df3796f to 87d4065 Compare August 28, 2017 07:13

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 28, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 28, 2017

k8s-github-robot merged commit 4ba2b62 into kubernetes:master Aug 28, 2017

bsalamat deleted the preemption_metacompute branch August 29, 2017 00:49

Add support to modify precomputed predicate metadata upon adding/removal of a pod #50805

Add support to modify precomputed predicate metadata upon adding/removal of a pod #50805

Conversation

bsalamat commented Aug 16, 2017

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat commented Aug 18, 2017

bsalamat commented Aug 21, 2017

wojtek-t commented Aug 22, 2017

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat commented Aug 22, 2017

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment