Figure out and implement custom handling for MatchInterPodAffinity predicate

As part of working on CA performance we run a large scale-up test with some additional logging including call count and total duration spent in each predicate. The results are as follows: 
```
E0824 09:27:54.815425       8 predicates.go:192] Predicate statistics for MaxAzureDiskVolumeCount: called 59985 times, total time 25.328538ms, mean duration 422ns
E0824 09:27:54.815489       8 predicates.go:192] Predicate statistics for MatchInterPodAffinity: called 59985 times, total time 1m48.73252767s, mean duration 1.812661ms
E0824 09:27:54.815497       8 predicates.go:192] Predicate statistics for CheckNodeCondition: called 59985 times, total time 49.05121ms, mean duration 817ns
E0824 09:27:54.815502       8 predicates.go:192] Predicate statistics for MaxEBSVolumeCount: called 59985 times, total time 24.906838ms, mean duration 415ns
E0824 09:27:54.815508       8 predicates.go:192] Predicate statistics for GeneralPredicates: called 59985 times, total time 114.434325ms, mean duration 1.907µs
E0824 09:27:54.815534       8 predicates.go:192] Predicate statistics for NoDiskConflict: called 59985 times, total time 26.067526ms, mean duration 434ns
E0824 09:27:54.815553       8 predicates.go:192] Predicate statistics for NoVolumeNodeConflict: called 59985 times, total time 38.554035ms, mean duration 642ns
E0824 09:27:54.815559       8 predicates.go:192] Predicate statistics for CheckNodeDiskPressure: called 59985 times, total time 19.062642ms, mean duration 317ns
E0824 09:27:54.815564       8 predicates.go:192] Predicate statistics for PodToleratesNodeTaints: called 59985 times, total time 22.448605ms, mean duration 374ns
E0824 09:27:54.815568       8 predicates.go:192] Predicate statistics for MaxGCEPDVolumeCount: called 59985 times, total time 61.944698ms, mean duration 1.032µs
E0824 09:27:54.815572       8 predicates.go:192] Predicate statistics for NoVolumeZoneConflict: called 59985 times, total time 64.231254ms, mean duration 1.07µs 
E0824 09:27:54.815578       8 predicates.go:192] Predicate statistics for PodFitsResources: called 357952 times, total time 514.838808ms, mean duration 1.438µs
E0824 09:27:54.815584       8 predicates.go:192] Predicate statistics for ready: called 59985 times, total time 50.931691ms, mean duration 849ns
E0824 09:27:54.815588       8 predicates.go:192] Predicate statistics for CheckNodeMemoryPressure: called 59985 times, total time 285.070472ms, mean duration 4.752µs
```

It turns out that MatchInterPodAffinity predicate is **3 orders of magnitude** slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it and we don't maintain predicateMeta object. 

After a quick glance at predicate code it makes sense - it needs to iterate over all existing pods to check if any of them has pod antiaffinity on the pod we're running predicates for. This brings up another problem - how does it get all pods and nodes? We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.

Bottom line is that using zone-level antiaffinity can cause CA to "overshoot" creating some nodes for pods that won't be able to schedule on them anyway. Fortunately, this is a pretty unlikely edge case and we will scale-down the unnecessary nodes without any problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out and implement custom handling for MatchInterPodAffinity predicate #257

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development