Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved event generation in volume mount,attach, and extend operations #56872

Merged
merged 1 commit into from
Dec 17, 2017

Conversation

davidz627
Copy link
Contributor

@davidz627 davidz627 commented Dec 6, 2017

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #57217

refactored mount, attach, extend operation's so that all failures generate events and event generation is more consistent.

Improved event generation in volume mount, attach, and extend operations

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 6, 2017
@davidz627
Copy link
Contributor Author

/assign jingxu97

@davidz627
Copy link
Contributor Author

/sig storage
/kind cleanup

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Dec 6, 2017
@@ -279,7 +279,7 @@ func (og *operationGenerator) GenerateAttachVolumeFunc(
return nil, attachableVolumePlugin.GetPluginName(), volumeToAttach.GenerateErrorDetailed("AttachVolume.NewAttacher failed", newAttacherErr)
}

return func() error {
attachVolumeFunc := func() (error, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It'd be helpful to have named return values

@verult
Copy link
Contributor

verult commented Dec 6, 2017

I noticed there's a SuccessfulMountVolume event but no other success events. IMO there should also be one for attach, detach, and unmount. While logs for these actions already exist, these events could help us bisect the problem even before going through logs.

@saad-ali
Copy link
Member

Instead of modifying the returned functions to add generic error handling (wrapping funcs in funcs, which gets messy and hard to test), how about modifying https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go instead? That will maintain a cleaner abstractions. NestedPendingOperations (and it's sister GoRoutineMap), are responsible for actually executing these operations and implement any common error handling (like exponential backoff). So you could extend NestedPendingOperations(and it's sisterGoRoutineMap) to trigger events on error and success. We probably want to be able to select which operations to trigger events for, so you'd also have pass in bools indicating whether each type of event was required for that operation which could get a little more complicated.

Your call.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 12, 2017
@davidz627
Copy link
Contributor Author

@saad-ali I tried out your suggestion and decided it was a better path; however, in the process noticed some parts were getting too complex and had too many return variables (as well as "misplaced" code) so I did some refactoring around the area too.

PTAL. Feel free to ask me any questions about the refactor. The events generated will be the same as the first version.

@davidz627
Copy link
Contributor Author

/retest

@davidz627
Copy link
Contributor Author

/kind technical-debt

@davidz627
Copy link
Contributor Author

/retest

@davidz627
Copy link
Contributor Author

/retest

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely LGTM. Will do another pass

eventRecorderFunc := func(err *error) {
if *err != nil {
for _, pod := range volumeToAttach.ScheduledPods {
og.recorder.Eventf(pod, v1.EventTypeWarning, kevents.FailedAttachVolume, (*err).Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify if the value of volumeToAttach.ScheduledPods is populated correctly when the operation is actually executed? Want to make sure there is no funkiness in referencing a variable that is no longer in scope.

Copy link
Contributor Author

@davidz627 davidz627 Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the closure will keep the variable around. There is one nuance that we need to work through though is the fact that each of the "pods" in Scheduled Pods is a reference, so if the pod itself changes between the eventRecorderFunc being generated and called, it will use the new version of the pod. I think that is actually intended behavior if it is the same pod and has just been mutated slightly.

It will not, however, use any other version of "ScheduledPods" as it is passed by value so we will not have pods being added or removed in between generation and execution of this function.

I believe both of these behaviors are working as intended. lmk if you think otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like reasonable assumptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing benefit of these eventrecorder functions... The way they are being used is arbitrary too and there is a reason it is arbitrary. Depending on error each operation may want to emit one more more events (or more than one type of event) - this generalization that each operation will return a GeneratedOperations and that in turn will have one EventRecorder function simply breaks down in multiple places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EventRecorder function can have arbitrary functionality to emit 0,1, or more events depending on necessity. This refactor just automates all the "plumbing" required to emit these events in one codepath. Therefore anyone wanting to emit new events or change existing events just has to add/modify the "eventRecorderFunc" of the GeneratedOperation they wish to modify.

The EventRecorder function to actual event emitted does not have to be 1 to 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed further with @gnufied offline. Seems like the issue is that this abstraction only has the ability to capture "Error" type events that are returned from the generated function. The abstraction "breaks down" when we try to use it for other types of events such as a Warning or Success.

The intended design of the abstraction was only for Error events, while warnings and success events would continue to use the direct og.recorder.Eventf method. This can be documented by changing the EventRecorderFunc to ErrorEventRecorderFunc so that it is no longer a misnomer.

@gnufied please confirm or correct these statements per your understanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm - I am not sure about net benefit of having this recorder in the first place. I have a use case - where I want to reduce severity of an event depending on error raised. I still want the whole operation to fail in general (because a non-failing operation will not respect exponential backoff) but in certain cases i want the error to be a warning (because operation will be retried anyways).


return volumetypes.GeneratedOperations{
OperationFunc: volumesAreAttachedFunc,
CompleteFunc: util.OperationCompleteHook("<n/a>", "verify_volumes_are_attached_per_node"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Explicitly add EventRecorderFunc: nil, // reason why nil so that folks reading this code in the future know this was left nil on purpose and why.

@davidz627
Copy link
Contributor Author

davidz627 commented Dec 14, 2017

For recording purposes: I am not planning on adding success events so that we do not provide redundant information in the events. A successful mount is the most important event for the user, it also indicates that all preceding operations have succeeded.

…erate events and event generation is more consistent.

refactored operation generator and operation executor to use more general generated functions for operations, completions, and events.
@saad-ali
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 14, 2017
@saad-ali
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidz627, saad-ali

Associated issue: #57217

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 15, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 56375, 56872, 57053, 57165, 57218). If you want to cherry-pick this change to another branch, please follow the instructions here.

requestTime := time.Now()
opComplete := func(err error) {
opComplete := func(err *error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there any particular reason to change this to address of an interface?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnufied pass error as value, will cause the passed err to operationcomplete is always nil, and metric storageOperationErrorMetric doesn't work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. @davidz627 explaind that below.

// Handle panic, if any, from operationFunc()
defer k8sRuntime.RecoverFromPanic(&err)
return operationFunc()
defer k8sRuntime.RecoverFromPanic(&detailedErr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before - I do not see why we are passing address of an interface in those places. @davidz627 can you explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added to address a bug I found while writing this PR. Previously the CompleteFuncs were not logging error at all.

In the previous version of the function take a look at the line:
defer operationCompleteFunc(err)
Because Go evaluates "defer"s variables at runtime, we see that the value of "err" will always be nil in this version.

The fixed version gives this operationCompleteFunction a pointer to err. Therefore when the OperationFunc completes execution, the RecoverFromPanic function will actually write a value to the location of the err pointer in case of error. Then the CompleteFunc will be able to use this error for logging or its other functions by inspecting the value written at the pointer location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope that helps! Let me know if there is any flaw in my reasoning.

I also did some testing to confirm the errors were not being logged previously, and are now being logged successfully after this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review and improve provision/attach/mount/etc logs/events for ease of debuggability
9 participants