-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Part 1] Implementation of equivalence pod #31605
Conversation
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message will repeat several times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. |
return &equivalencePod | ||
} | ||
|
||
type EquivalencePod struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I wrote in the previous PR - this is not enough for the pod to actually be equivalent to the other one.
We should probably open an issue for this particular problem and discuss it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick note:
namespace
labels
some annotations (PodAffinity, and others?)
How to enforce that when someone will add a new predicate/priority this will be mirrored here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCE e2e build/test passed for commit 3f49fd1. |
Jenkins GCE e2e failed for commit 15404d320121707143fe6ea6efb130c214dd7021. The magic incantation to run this job again is |
98bac83
to
f4b8f00
Compare
@resouer - I'm OOO this week - will take a look early next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a brief look - I will do a more careful review tomorrow. But added some high-level comment.
if !allExpired { | ||
ec.expireLock.Lock() | ||
defer ec.expireLock.Unlock() | ||
ec.invalidAlgorithmCacheList.Insert(nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I don't see a reason for making it asynchronous. Why can't we just clear the cache for this node (instead of just marking it as invalid).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@resouer - any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, this can make the cache part much simpler, refactoring
|
||
if !allExpired { | ||
ec.expireLock.Lock() | ||
ec.allCacheExpired = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
@resouer - thanks; let me know when it will be ready for another look |
// to be equivalent | ||
if len(pod.OwnerReferences) != 0 { | ||
for _, ref := range pod.OwnerReferences { | ||
if *ref.Controller && ref.Kind != "PetSet" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make this "PetSet" a constant?
Or even better, this should probably by slice if values (and for now it will only contain PetSet).
"sync" | ||
) | ||
|
||
const maxCacheEntries = 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4096 by default might be to many (e.g. in small clusters). Can you please add a TODO to figure out what this value should be.
// SendInvalidAlgorithmCacheReq marks AlgorithmCache item as invalid | ||
func (ec *EquivalenceCache) SendInvalidAlgorithmCacheReq(nodeName string) { | ||
ec.expireLock.RLock() | ||
allExpired := ec.allCacheExpired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - I'm wondering what's the purpose if "allCacheExpired".
Why can't we just do:
ec.expireLock().Lock()
defer ec.exprireLock.Unlock()
delete(ec.algorithmCache, nodeName)
What is the usecase you had in mind for having this "allCacheExpired"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in my mind, in most cases, we should invalid cache for all Nodes
, e.g. Controller or Service changed, while in some cases, e.g. Pod deleted, we only need toinvalid the node bound with it
. Discussion of this logic belongs to Part 2, so I just keep them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And just fixed other nits
getEquivalencePod algorithm.GetEquivalencePodFunc | ||
algorithmCache map[string]AlgorithmCache | ||
allCacheExpired bool | ||
expireLock *sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would suggest changing to:
type EquivalenceCache struct {
sync.RWMutex
getEquivalencePod algorithm.GetEquivalencePodFunc
algorithmCache map[string]AlgorithmCache
allCacheExpired bool
}
Then you can simply do "ec.Lock()".
// SendInvalidAlgorithmCacheReq marks AlgorithmCache item as invalid | ||
func (ec *EquivalenceCache) SendInvalidAlgorithmCacheReq(nodeName string) { | ||
ec.RLock() | ||
allExpired := ec.allCacheExpired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous commend disappeared, so let me comment here.
OK - I buy your argument that there are situations in which we want to invalidate the whole cache.
However, I think that this particular variable is not needed. Basically, what I would do when we need to invalidate all cache, I would simply delete all entries from it. And having this variable doesn't make much sense to me (if we have the whole cache invalidated, we can also call delete(ec.aglrotihmCache, nodeName) and that will be correct, an not visibly slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Em, make sense, the flag doesn't save us time actually, removed it.
@k8s-bot kubemark e2e test this |
Jenkins Kubemark GCE e2e failed for commit 497b691. Full PR test history. The magic incantation to run this job again is |
// isValidControllerKind checks if a given controller's kind can be applied to equivalence pod algorithm. | ||
func isValidControllerKind(kind string) bool { | ||
switch kind { | ||
case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just about formatting, please change to:
switch kind {
// list of kinds that we cannot handle
case PetSetKind:
return false;
default:
return true;
}
for nodeName, _ := range ec.algorithmCache { | ||
delete(ec.algorithmCache, nodeName) | ||
} | ||
ec.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change to "defer ec.Unlock()" and move to the top
@@ -244,6 +248,11 @@ func RegisterCustomPriorityFunction(policy schedulerapi.PriorityPolicy) string { | |||
return RegisterPriorityConfigFactory(policy.Name, *pcf) | |||
} | |||
|
|||
func RegisterGetEquivalencePodFunction(equivalenceFunc algorithm.GetEquivalencePodFunc) { | |||
glog.Info("Register getEquivalencePodFunc.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would suggest removing this comment
@resouer - kubemark failure is not your fault. We had some internal issues, which hopefully should be fixed soon. I added few more, but very minor comments - once applied, this LGTM |
Jenkins verification failed for commit 0db7497f0e2442f1ccd7ed3612407578af8e11ee. Full PR test history. The magic incantation to run this job again is |
@resouer - can you please fix gofmt:
Also - please squash commits, and I will lgtm it then. |
Update equivalent class & remove priority Use controller ref Directly clear the cache
Jenkins GCI GCE e2e failed for commit 204dbe7. Full PR test history. The magic incantation to run this job again is |
LGTM - thanks a lot for this PR! |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
Part 1 of #30844
This PR:
predicate.go
, so thatGetResourceRequest
can be used in other places toGetEquivalencePod
.equivalence_cache.go
to deal with all information we need to calculate an equivalent pod.RegisterGetEquivalencePodFunction
.Work in next PR:
equivalence_cache.go
I think we can begin from the
equivalence_cache.go
? Thanks. cc @wojtek-t @davidoppIf I missed any other necessary part, please point it out.
This change is