Figure out and implement custom handling for MatchInterPodAffinity predicate #257
Description
As part of working on CA performance we run a large scale-up test with some additional logging including call count and total duration spent in each predicate. The results are as follows:
E0824 09:27:54.815425 8 predicates.go:192] Predicate statistics for MaxAzureDiskVolumeCount: called 59985 times, total time 25.328538ms, mean duration 422ns
E0824 09:27:54.815489 8 predicates.go:192] Predicate statistics for MatchInterPodAffinity: called 59985 times, total time 1m48.73252767s, mean duration 1.812661ms
E0824 09:27:54.815497 8 predicates.go:192] Predicate statistics for CheckNodeCondition: called 59985 times, total time 49.05121ms, mean duration 817ns
E0824 09:27:54.815502 8 predicates.go:192] Predicate statistics for MaxEBSVolumeCount: called 59985 times, total time 24.906838ms, mean duration 415ns
E0824 09:27:54.815508 8 predicates.go:192] Predicate statistics for GeneralPredicates: called 59985 times, total time 114.434325ms, mean duration 1.907µs
E0824 09:27:54.815534 8 predicates.go:192] Predicate statistics for NoDiskConflict: called 59985 times, total time 26.067526ms, mean duration 434ns
E0824 09:27:54.815553 8 predicates.go:192] Predicate statistics for NoVolumeNodeConflict: called 59985 times, total time 38.554035ms, mean duration 642ns
E0824 09:27:54.815559 8 predicates.go:192] Predicate statistics for CheckNodeDiskPressure: called 59985 times, total time 19.062642ms, mean duration 317ns
E0824 09:27:54.815564 8 predicates.go:192] Predicate statistics for PodToleratesNodeTaints: called 59985 times, total time 22.448605ms, mean duration 374ns
E0824 09:27:54.815568 8 predicates.go:192] Predicate statistics for MaxGCEPDVolumeCount: called 59985 times, total time 61.944698ms, mean duration 1.032µs
E0824 09:27:54.815572 8 predicates.go:192] Predicate statistics for NoVolumeZoneConflict: called 59985 times, total time 64.231254ms, mean duration 1.07µs
E0824 09:27:54.815578 8 predicates.go:192] Predicate statistics for PodFitsResources: called 357952 times, total time 514.838808ms, mean duration 1.438µs
E0824 09:27:54.815584 8 predicates.go:192] Predicate statistics for ready: called 59985 times, total time 50.931691ms, mean duration 849ns
E0824 09:27:54.815588 8 predicates.go:192] Predicate statistics for CheckNodeMemoryPressure: called 59985 times, total time 285.070472ms, mean duration 4.752µs
It turns out that MatchInterPodAffinity predicate is 3 orders of magnitude slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it and we don't maintain predicateMeta object.
After a quick glance at predicate code it makes sense - it needs to iterate over all existing pods to check if any of them has pod antiaffinity on the pod we're running predicates for. This brings up another problem - how does it get all pods and nodes? We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.
Bottom line is that using zone-level antiaffinity can cause CA to "overshoot" creating some nodes for pods that won't be able to schedule on them anyway. Fortunately, this is a pretty unlikely edge case and we will scale-down the unnecessary nodes without any problem.