-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apimachinery runtime: support contextual logging #121970
apimachinery runtime: support contextual logging #121970
Conversation
// It expects to be called as <caller> -> HandleError[InContext] -> logError. | ||
func logError(ctx context.Context, err error) { | ||
logger := klog.FromContext(ctx).WithCallDepth(2) | ||
logger.Error(err, "Unhandled error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm torn between calling this "Unhandled error" and "Handled error". I chose "Unhandled error" because logging an error is not the same as handling it in code - the whole point of logging at error level is that something unexpected occurred that needs attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if !strings.Contains(capturedOutput, `timeout or abort while handling: method=GET URI="/" audit-ID=""`) { |
Before the log entry was unstructured:
E1122 08:24:37.387820 9070 wrap.go:54] timeout or abort while handling: method=GET URI="/" audit-ID=""
Now it is structured:
E1121 21:17:48.088946 61801 wrap.go:54] "Unhandled error" err="timeout or abort while handling: method=GET URI=\"/\" audit-ID=\"\""
Did we intentionally or accidentally make this output part of the apiserver API? In other words, can this output be changed?
Would it make sense to change this further? There are plenty of cases where HandleError
is called for an error that gets created via fmt.Errorf
- this is unstructured printf-style logging! A HandleErrorInContext(ctx, err, <key/value pairs>)
would lead to better output.
/cc @tkashem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like UnhandledError
as this is what these usually are. The code variant of 🤷♂️, i.e. we cannot sensibly handle that error in the context, but want to log it for visibility that something is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed an update. When passing message and key/value pairs as separate parameters, the "Unhandled error" has to be added via WithName
because the message string is controlled by the caller. The output in the timeout test now is:
"Timeout or abort while handling" logger="UnhandledError" method="GET" URI="/" auditID=""
hack/golangci-hints.yaml
Outdated
msg: should consider HandleErrorInContext in new or modified code | ||
- p: \.HandleCrash | ||
pkg: ^k8s\.io/apimachinery/pkg/util/runtime | ||
msg: should consider HandleCrashInContext in new or modified code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reluctant to declare HandleError
and HandlePanic
as deprecated. That leads to confusion and unnecessary work in downstream projects which are fine with the existing functions.
Instead, hints that only show up in pull-kubernetes-linter-hints
get defined here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the same problem whenever we introduce a new API which should be used instead of some old, still supported one. Even if we were to deprecate the old one, we still wouldn't know exactly where in the existing code base we should replace it.
Let me propose something else:
- Instead of
// Deprecated: Use HandleCrashWithContext
forHandleCrash
, we add//logcheck:context // Use HandleCrashWithContext instead in code which supports contextual logging.
. - The logcheck linter gets extended to recognize this special comment for functions and methods.
- If, and only if, some code is expected to be "contextual" (defined in
hack/logcheck.conf
) and uses such an API, then logcheck will report it.
This is conceptually similar to how it warns about some known klog API calls that shouldn't be used anymore, except that the knowledge about those calls is not baked into the linter.
@sttts how does this sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one difference compared to the hints: as soon as we mark an API like this, we have to fix code which currently is supposed to be contextual (kube-controller-manager, kube-scheduler, etc.). I think that's doable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented in kubernetes-sigs/logtools#25.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no strong opinion. But yes, sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm doing the conversion for kube-controller now to verify that it's doable. One observation: there are a lot of HandleError
calls which should never be reached (checked type conversions, getting key for object, accessing local cache). I'm not going to convert those because introducing a context in the call chain just for dead code isn't worth the effort and overhead. Instead, I'll add nolint:logcheck
with comment to indicate that the call has been checked and is okay in those places.
8b2a4f6
to
2af8486
Compare
// E.g., you can provide one or more additional handlers for something like shutting down go routines gracefully. | ||
// | ||
// The context is used to determine how to log. Should be used instead of HandleCrash. | ||
func HandleCrashInContext(ctx context.Context, additionalHandlers ...func(context.Context, interface{})) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WithContext is common, not InContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. My thinking was that "handle the error in the context in which it occurred" is a better description of what the function does. Having said that, HandleWithContext
is also okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
2af8486
to
700db5b
Compare
700db5b
to
8a48038
Compare
8a48038
to
80e7c3d
Compare
/assign @logicalhan |
/cc |
/remove lifecycle-rotten The logcheck update which handles @sttts: can you perhaps take another look? |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
@@ -35,18 +36,20 @@ var ( | |||
) | |||
|
|||
// PanicHandlers is a list of functions which will be invoked when a panic happens. | |||
var PanicHandlers = []func(interface{}){logPanic} | |||
var PanicHandlers = []func(context.Context, interface{}){logPanic} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many projects set this manually? My expecation is nearly nobody. Can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find four places in controller-runtime, Cilium, Gardener: https://grep.app/search?q=utilruntime.PanicHandlers&filter[path][0]=pkg/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that they use the utilruntime
import name. Might be a bit more which don't use it like that.
|
||
// HandleCrash simply catches a crash and logs an error. Meant to be called via | ||
// defer. Additional context-specific handlers can be provided, and will be | ||
// called in case of panic. HandleCrash actually crashes, after calling the | ||
// handlers and logging the panic message. | ||
// | ||
// E.g., you can provide one or more additional handlers for something like shutting down go routines gracefully. | ||
// | ||
// TODO(pohly): logcheck:context // HandleCrashWithContext should be used instead of HandleCrash in code which supports contextual logging. | ||
func HandleCrash(additionalHandlers ...func(interface{})) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deprecate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we express HandleCrash through HandleCrashWithContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deprecate?
https://grep.app/search?q=utilruntime.HandleCrash shows that HandleCrash
is used quite a bit outside of Kubernetes. Marking it as deprecated IMHO is too strong: it implies that we expect all of those downstream consumers of utilruntime
to change their code although it works perfectly fine for them.
It's only code which cares about contextual logging which should use HandleErrorWithContext
.
Can we express HandleCrash through HandleCrashWithContext?
That would interfere with stack unwinding in callbacks, for example logError
: as it stands now, they know that they need to skip HandleCrash or HandleCrashWithContext. If HandleCrash calls HandleCrashWithContext, that adds another level and the callback won't know that.
} | ||
} | ||
|
||
// ErrorHandlers is a list of functions which will be invoked when a nonreturnable | ||
// error occurs. | ||
// TODO(lavalamp): for testability, this and the below HandleError function | ||
// should be packaged up into a testable and reusable object. | ||
var ErrorHandlers = []func(error){ | ||
var ErrorHandlers = []ErrorHandler{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as above: what does a github search tell about projects using this variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is used a bit more, I find 16 places in 8 projects: https://grep.app/search?q=utilruntime.ErrorHandlers
Same caveat about using that import name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add ErrorHandlersWithContext
. Would be easy enough. No super strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might have worked, except for one particular usage:
https://github.com/siderolabs/talos/blob/b752a86183c990848f05d86fa1b41942b4f1610c/internal/app/machined/pkg/controllers/k8s/k8s.go#L15
If we add "our" default error handler to such a new ErrorHandlersWithContext
, then that code above compiles, but then crashes at runtime. I think I prefer breaking the API at compile time - but no strong opinion either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one. Sure that will break whatever we do.
// HandlerError is a method to invoke when a non-user facing piece of code cannot | ||
// return an error and needs to indicate it has been ignored. Invoking this method | ||
// is preferable to logging the error - the default behavior is to log but the | ||
// errors may be sent to a remote server for analysis. | ||
// | ||
// TODO(pohly): logcheck:context // HandleErrorWithContext should be used instead of HandleError in code which supports contextual logging. | ||
func HandleError(err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above: express HandleError though HandleErrorWithContext. No need to duplicate code.
And deprecate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed that before in #121970 (review). I would still prefer to deprecate, but not strong opinion and definitely no blocker.
828fe2b
to
a71cb8d
Compare
handleError(context.Background(), err, "Unhandled Error") | ||
} | ||
|
||
// HandlerError is a method to invoke when a non-user facing piece of code cannot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: HandleErrorWithContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
/lgtm |
LGTM label has been added. Git tree hash: 5896b3f5e5f149d2c0baf7936958c5d4e976b727
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pohly, sttts The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
In contrast to the original HandleError and HandleCrash, the new HandleErrorWithContext and HandleCrashWithContext functions properly do contextual logging, so if a problem occurs while e.g. dealing with a certain request and WithValues was used for that request, then the error log entry will also contain information about it. The output changes from unstructured to structured, which might be a breaking change for users who grep for panics. Care was taken to format panics as similar as possible to the original output. For errors, a message string gets added. There was none before, which made it impossible to find all error output coming from HandleError. Keeping HandleError and HandleCrash around without deprecating while changing the signature of callbacks is a compromise between not breaking existing code and not adding too many special cases that need to be supported. There is some code which uses PanicHandlers or ErrorHandlers, but less than code that uses the Handle* calls. In Kubernetes, we want to replace the calls. logcheck warns about them in code which is supposed to be contextual. The steps towards that are: - add TODO remarks as reminder (this commit) - locally remove " TODO(pohly): " to enable the check with `//logcheck:context`, merge fixes for linter warnings - once there are none, remove the TODO to enable the check permanently
a71cb8d
to
5a130d2
Compare
I prefer to fix typos - done, please re-add LGTM. /hold cancel |
/lgtm |
LGTM label has been added. Git tree hash: fdf757d98e43a5bb133fc3bb22c187f5774a9e3e
|
References: kubernetes/kubernetes#121970 kubernetes/apimachinery@126f5ce Signed-off-by: Cem Mergenci <cmergenci@gmail.com>
What type of PR is this?
/kind cleanup
/kind api-change
What this PR does / why we need it:
In contrast to the original HandleError and HandleCrash, the new HandleErrorInContext and HandleCrashInContext functions properly do contextual logging, so if a problem occurs while e.g. dealing with a certain request and WithValues was used for that request, then the error log entry will also contain information about it.
The output changes from unstructured to structured, which might be a breaking change for users who grep for panics. Care was taken to format panics as similar as possible to the original output.
For errors, a message string gets added. There was none before, which made it impossible to find all error output coming from HandleError.
Special notes for your reviewer:
Keeping HandleError and HandleCrash around without deprecating while changing the signature of callbacks is a compromise between not breaking existing code and not adding too many special cases that need to be supported. There is some code which uses PanicHandlers or ErrorHandlers, but less than code that uses the Handle* calls.
In Kubernetes, we want to replace the calls. logcheck v0.8.0 warns about them in code which is supposed to be ontextual. The steps towards that are:
Does this PR introduce a user-facing change?
/wg structured-logging
/sig instrumentation