-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved event generation in volume mount,attach, and extend operations #56872
Conversation
/assign jingxu97 |
/sig storage |
@@ -279,7 +279,7 @@ func (og *operationGenerator) GenerateAttachVolumeFunc( | |||
return nil, attachableVolumePlugin.GetPluginName(), volumeToAttach.GenerateErrorDetailed("AttachVolume.NewAttacher failed", newAttacherErr) | |||
} | |||
|
|||
return func() error { | |||
attachVolumeFunc := func() (error, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It'd be helpful to have named return values
I noticed there's a SuccessfulMountVolume event but no other success events. IMO there should also be one for attach, detach, and unmount. While logs for these actions already exist, these events could help us bisect the problem even before going through logs. |
Instead of modifying the returned functions to add generic error handling (wrapping funcs in funcs, which gets messy and hard to test), how about modifying https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go instead? That will maintain a cleaner abstractions. Your call. |
@saad-ali I tried out your suggestion and decided it was a better path; however, in the process noticed some parts were getting too complex and had too many return variables (as well as "misplaced" code) so I did some refactoring around the area too. PTAL. Feel free to ask me any questions about the refactor. The events generated will be the same as the first version. |
/retest |
/kind technical-debt |
/retest |
0365acd
to
7b3dffa
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely LGTM. Will do another pass
eventRecorderFunc := func(err *error) { | ||
if *err != nil { | ||
for _, pod := range volumeToAttach.ScheduledPods { | ||
og.recorder.Eventf(pod, v1.EventTypeWarning, kevents.FailedAttachVolume, (*err).Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you verify if the value of volumeToAttach.ScheduledPods
is populated correctly when the operation is actually executed? Want to make sure there is no funkiness in referencing a variable that is no longer in scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the closure will keep the variable around. There is one nuance that we need to work through though is the fact that each of the "pods" in Scheduled Pods is a reference, so if the pod itself changes between the eventRecorderFunc being generated and called, it will use the new version of the pod. I think that is actually intended behavior if it is the same pod and has just been mutated slightly.
It will not, however, use any other version of "ScheduledPods" as it is passed by value so we will not have pods being added or removed in between generation and execution of this function.
I believe both of these behaviors are working as intended. lmk if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like reasonable assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not seeing benefit of these eventrecorder functions... The way they are being used is arbitrary too and there is a reason it is arbitrary. Depending on error each operation may want to emit one more more events (or more than one type of event) - this generalization that each operation will return a GeneratedOperations
and that in turn will have one EventRecorder
function simply breaks down in multiple places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EventRecorder function can have arbitrary functionality to emit 0,1, or more events depending on necessity. This refactor just automates all the "plumbing" required to emit these events in one codepath. Therefore anyone wanting to emit new events or change existing events just has to add/modify the "eventRecorderFunc" of the GeneratedOperation they wish to modify.
The EventRecorder function to actual event emitted does not have to be 1 to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed further with @gnufied offline. Seems like the issue is that this abstraction only has the ability to capture "Error" type events that are returned from the generated function. The abstraction "breaks down" when we try to use it for other types of events such as a Warning or Success.
The intended design of the abstraction was only for Error events, while warnings and success events would continue to use the direct og.recorder.Eventf method. This can be documented by changing the EventRecorderFunc to ErrorEventRecorderFunc so that it is no longer a misnomer.
@gnufied please confirm or correct these statements per your understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm - I am not sure about net benefit of having this recorder in the first place. I have a use case - where I want to reduce severity of an event depending on error raised. I still want the whole operation to fail in general (because a non-failing operation will not respect exponential backoff) but in certain cases i want the error to be a warning (because operation will be retried anyways).
|
||
return volumetypes.GeneratedOperations{ | ||
OperationFunc: volumesAreAttachedFunc, | ||
CompleteFunc: util.OperationCompleteHook("<n/a>", "verify_volumes_are_attached_per_node"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Explicitly add EventRecorderFunc: nil, // reason why nil
so that folks reading this code in the future know this was left nil on purpose and why.
7b3dffa
to
9988fd8
Compare
For recording purposes: I am not planning on adding success events so that we do not provide redundant information in the events. A successful mount is the most important event for the user, it also indicates that all preceding operations have succeeded. |
…erate events and event generation is more consistent. refactored operation generator and operation executor to use more general generated functions for operations, completions, and events.
9988fd8
to
e3f8f64
Compare
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: davidz627, saad-ali Associated issue: #57217 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue (batch tested with PRs 56375, 56872, 57053, 57165, 57218). If you want to cherry-pick this change to another branch, please follow the instructions here. |
requestTime := time.Now() | ||
opComplete := func(err error) { | ||
opComplete := func(err *error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there any particular reason to change this to address of an interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gnufied pass error as value, will cause the passed err to operationcomplete
is always nil
, and metric storageOperationErrorMetric
doesn't work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. @davidz627 explaind that below.
// Handle panic, if any, from operationFunc() | ||
defer k8sRuntime.RecoverFromPanic(&err) | ||
return operationFunc() | ||
defer k8sRuntime.RecoverFromPanic(&detailedErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before - I do not see why we are passing address of an interface in those places. @davidz627 can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added to address a bug I found while writing this PR. Previously the CompleteFuncs were not logging error at all.
In the previous version of the function take a look at the line:
defer operationCompleteFunc(err)
Because Go evaluates "defer"s variables at runtime, we see that the value of "err" will always be nil in this version.
The fixed version gives this operationCompleteFunction a pointer to err. Therefore when the OperationFunc completes execution, the RecoverFromPanic function will actually write a value to the location of the err pointer in case of error. Then the CompleteFunc will be able to use this error for logging or its other functions by inspecting the value written at the pointer location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope that helps! Let me know if there is any flaw in my reasoning.
I also did some testing to confirm the errors were not being logged previously, and are now being logged successfully after this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #57217
refactored mount, attach, extend operation's so that all failures generate events and event generation is more consistent.