-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix expectations in Deployment. #21963
Conversation
Labelling this PR as size/L |
GCE e2e test build/test passed for commit 70df516d5311d509d981a44b70a2ab51b207f547. |
GCE e2e test build/test passed for commit 70d5ad43fd26d7631044f22fd52aec5ebf217b7b. |
GCE e2e test build/test passed for commit 225a477805dad4b713162f94da93f21cb5f7f5ff. |
GCE e2e build/test failed for commit e995d58286046e711a7065b67c1e88c7df58a7f6. |
All tests passed except for:
|
The paused test fails to delete all of the Deployment's pods at the very end. |
Ok. I changed Deployment to do straightforward refcounting of pod additions and deletions. This made all deployment tests pass for me, except for pause. I need someone else (or multiple someones) to look at pause. And I would like someone to change the ExpirationCache to permit explicit timeout instead of expiration on Get. The timestamp should be reset every time syncDeployment is able to make progress (i.e., stores are synced and expectations are met) -- this should be when SetExpectations is called. Without vectors of logical clocks, this is not going to be bulletproof, but it's at least not blatantly wrong. |
GCE e2e test build/test passed for commit 02565d84f4b0686424d21e089b18ee46d3c782a4. |
dc.podExpectations.RaiseExpectations(dKey, newScale-rs.Spec.Replicas, 0) | ||
} else { | ||
scalingOperation = "down" | ||
dc.podExpectations.LowerExpectations(dKey, 0, rs.Spec.Replicas-newScale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be:
dc.podExpectations.RaiseExpectations(dKey, 0, rs.Spec.Replicas-newScale)
or
dc.podExpectations.LowerExpectations(dKey, 0, newScale-rs.Spec.Replicas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
LGTM, other than one minor comment. Please feel free to apply the label after making that change. |
Addressed feedback and did some more cleanup (conversion of Set/Delete to Raise/Lower). Enabled flaky tests in PR builder. |
func (e *ControlleeExpectations) Seen(add, del int64) { | ||
atomic.AddInt64(&e.add, -add) | ||
atomic.AddInt64(&e.del, -del) | ||
// Seen increments the add and del counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the comment to Add
GCE e2e build/test failed for commit 3ecc59850ffd48cafbf5eb4b5d0bfe74d13e9d77. Please reference the list of currently known flakes when examining this failure. |
GCE e2e build/test failed for commit b90f631a6bb67ff84ab55e526d93e6e9499edca7. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
Tests failed due to timeout:
Can't find the controller manger log in the bucket. |
@janetkuo Thanks. I'll take a look. Those are all the same tests that have been failing, so looks like there's still a problem. |
GCE e2e build/test failed for commit 6fd1c532b634914a8884177e55ba87bb0f97ceba. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
Tests seem to be still failing and the kube-controller-manager.log is huge (13G) so I didn't try to open it. Tailing the file all I see is this repeating error message:
|
Added expiration back so I can (hopefully) see what expectations aren't being satisfied. |
GCE e2e build/test passed for commit c9a1ce66d07fc3b1445d5414fc2423951f1937bd. |
GCE e2e build/test passed for commit e1362e384a2c4874122e495e9d04e93095da2d29. |
Controller manager log was huge because I was always re-enqueueing on failure, only to fail again. Pods from ReplicaSets created with non-zero replicas weren't accounted for. It also appears that not all expected deletions happen. Could be faulty error handling.
|
GCE e2e build/test passed for commit 31f95ae64381a6f33f526f4bb018312d3647ad01. |
It appears to be pretty common for deletions to be observed after expectations are determined to be satisfied and are reset. |
GCE e2e build/test failed for commit 5837987cf7ed2ff918641bae499abe0ef5098539. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
GCE e2e build/test passed for commit 554eee3babb02e1c273e7a81898688c02a0b92a9. |
Something fishy is going on with deletes, but I don't have time to investigate more. This seems to make all the tests pass. PTAL |
BTW, this also makes the controller manager log comparable in size to the other components. |
GCE e2e build/test passed for commit 39f0edc. |
func (e *ControlleeExpectations) Seen(add, del int64) { | ||
atomic.AddInt64(&e.add, -add) | ||
atomic.AddInt64(&e.del, -del) | ||
// Add increments the add and del counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much more readable and easier to comprehend than the Seen()
language. Thanks!
LGTM. I will let @janetkuo to add the label after taking a look. |
LGTM |
Merging to fix tests. |
Fix expectations in Deployment.
Working on a fix for #19299.
Probably still want eventual expiration, but if e2e doesn't pass without that, there's got to be a problem.
cc @janetkuo @Kargakis @mqliang @nikhiljindal @ironcladlou