Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Part 1] Implementation of equivalence pod #31605

Merged
merged 1 commit into from
Oct 10, 2016

Conversation

resouer
Copy link
Contributor

@resouer resouer commented Aug 29, 2016

Part 1 of #30844

This PR:

  • Refactored predicate.go, so that GetResourceRequest can be used in other places to GetEquivalencePod.
  • Implement a equivalence_cache.go to deal with all information we need to calculate an equivalent pod.
  • Define and register a RegisterGetEquivalencePodFunction.

Work in next PR:

  • The update of equivalence_cache.go
  • Unit test
  • Integration/e2e test

I think we can begin from the equivalence_cache.go? Thanks. cc @wojtek-t @davidopp

If I missed any other necessary part, please point it out.


This change is Reviewable

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Aug 29, 2016
@k8s-bot
Copy link

k8s-bot commented Aug 29, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message will repeat several times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

@wojtek-t wojtek-t self-assigned this Aug 31, 2016
return &equivalencePod
}

type EquivalencePod struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote in the previous PR - this is not enough for the pod to actually be equivalent to the other one.
We should probably open an issue for this particular problem and discuss it there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick note:

namespace
labels
some annotations (PodAffinity, and others?)

How to enforce that when someone will add a new predicate/priority this will be mirrored here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t Issue fired: #32024

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 14, 2016
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2016
@k8s-bot
Copy link

k8s-bot commented Sep 19, 2016

GCE e2e build/test passed for commit 3f49fd1.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE e2e failed for commit 15404d320121707143fe6ea6efb130c214dd7021.

The magic incantation to run this job again is @k8s-bot gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@resouer resouer force-pushed the eclass-1 branch 2 times, most recently from 98bac83 to f4b8f00 Compare September 21, 2016 04:32
@resouer
Copy link
Contributor Author

resouer commented Sep 21, 2016

@wojtek-t Added a commit to use ControllerRef, PTAL

ref #32024

@wojtek-t
Copy link
Member

@resouer - I'm OOO this week - will take a look early next week.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a brief look - I will do a more careful review tomorrow. But added some high-level comment.

if !allExpired {
ec.expireLock.Lock()
defer ec.expireLock.Unlock()
ec.invalidAlgorithmCacheList.Insert(nodeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I don't see a reason for making it asynchronous. Why can't we just clear the cache for this node (instead of just marking it as invalid).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@resouer - any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, this can make the cache part much simpler, refactoring


if !allExpired {
ec.expireLock.Lock()
ec.allCacheExpired = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@wojtek-t wojtek-t added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Sep 27, 2016
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2016
@wojtek-t
Copy link
Member

wojtek-t commented Oct 6, 2016

@resouer - thanks; let me know when it will be ready for another look

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2016
// to be equivalent
if len(pod.OwnerReferences) != 0 {
for _, ref := range pod.OwnerReferences {
if *ref.Controller && ref.Kind != "PetSet" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make this "PetSet" a constant?
Or even better, this should probably by slice if values (and for now it will only contain PetSet).

"sync"
)

const maxCacheEntries = 4096
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4096 by default might be to many (e.g. in small clusters). Can you please add a TODO to figure out what this value should be.

// SendInvalidAlgorithmCacheReq marks AlgorithmCache item as invalid
func (ec *EquivalenceCache) SendInvalidAlgorithmCacheReq(nodeName string) {
ec.expireLock.RLock()
allExpired := ec.allCacheExpired
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - I'm wondering what's the purpose if "allCacheExpired".
Why can't we just do:

ec.expireLock().Lock()
defer ec.exprireLock.Unlock()
delete(ec.algorithmCache, nodeName)

What is the usecase you had in mind for having this "allCacheExpired"?

Copy link
Contributor Author

@resouer resouer Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in my mind, in most cases, we should invalid cache for all Nodes, e.g. Controller or Service changed, while in some cases, e.g. Pod deleted, we only need toinvalid the node bound with it. Discussion of this logic belongs to Part 2, so I just keep them here.

Copy link
Contributor Author

@resouer resouer Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just fixed other nits

getEquivalencePod algorithm.GetEquivalencePodFunc
algorithmCache map[string]AlgorithmCache
allCacheExpired bool
expireLock *sync.RWMutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would suggest changing to:

type EquivalenceCache struct {
    sync.RWMutex
    getEquivalencePod algorithm.GetEquivalencePodFunc
    algorithmCache    map[string]AlgorithmCache
    allCacheExpired   bool
}

Then you can simply do "ec.Lock()".

// SendInvalidAlgorithmCacheReq marks AlgorithmCache item as invalid
func (ec *EquivalenceCache) SendInvalidAlgorithmCacheReq(nodeName string) {
ec.RLock()
allExpired := ec.allCacheExpired
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous commend disappeared, so let me comment here.

OK - I buy your argument that there are situations in which we want to invalidate the whole cache.
However, I think that this particular variable is not needed. Basically, what I would do when we need to invalidate all cache, I would simply delete all entries from it. And having this variable doesn't make much sense to me (if we have the whole cache invalidated, we can also call delete(ec.aglrotihmCache, nodeName) and that will be correct, an not visibly slower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em, make sense, the flag doesn't save us time actually, removed it.

@resouer
Copy link
Contributor Author

resouer commented Oct 8, 2016

@k8s-bot kubemark e2e test this

@k8s-ci-robot
Copy link
Contributor

Jenkins Kubemark GCE e2e failed for commit 497b691. Full PR test history.

The magic incantation to run this job again is @k8s-bot kubemark e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

// isValidControllerKind checks if a given controller's kind can be applied to equivalence pod algorithm.
func isValidControllerKind(kind string) bool {
switch kind {
case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just about formatting, please change to:

switch kind {
// list of kinds that we cannot handle
case PetSetKind:
  return false;
default:
  return true;
}

for nodeName, _ := range ec.algorithmCache {
delete(ec.algorithmCache, nodeName)
}
ec.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change to "defer ec.Unlock()" and move to the top

@@ -244,6 +248,11 @@ func RegisterCustomPriorityFunction(policy schedulerapi.PriorityPolicy) string {
return RegisterPriorityConfigFactory(policy.Name, *pcf)
}

func RegisterGetEquivalencePodFunction(equivalenceFunc algorithm.GetEquivalencePodFunc) {
glog.Info("Register getEquivalencePodFunc.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would suggest removing this comment

@wojtek-t
Copy link
Member

wojtek-t commented Oct 8, 2016

@resouer - kubemark failure is not your fault. We had some internal issues, which hopefully should be fixed soon.

I added few more, but very minor comments - once applied, this LGTM

@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit 0db7497f0e2442f1ccd7ed3612407578af8e11ee. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@wojtek-t
Copy link
Member

wojtek-t commented Oct 9, 2016

@resouer - can you please fix gofmt:

Verifying hack/make-rules/../../hack/verify-gofmt.sh
!!! 'gofmt -s -w' needs to be run on the following files: 
./plugin/pkg/scheduler/equivalence_cache.go

Also - please squash commits, and I will lgtm it then.

Update equivalent class & remove priority

Use controller ref

Directly clear the cache
@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GCE e2e failed for commit 204dbe7. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@wojtek-t
Copy link
Member

LGTM - thanks a lot for this PR!

@wojtek-t wojtek-t added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 5509e50 into kubernetes:master Oct 10, 2016
@resouer resouer deleted the eclass-1 branch November 4, 2016 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants