Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apimachinery runtime: support contextual logging #121970

Merged
merged 1 commit into from
Apr 18, 2024

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Nov 20, 2023

What type of PR is this?

/kind cleanup
/kind api-change

What this PR does / why we need it:

In contrast to the original HandleError and HandleCrash, the new HandleErrorInContext and HandleCrashInContext functions properly do contextual logging, so if a problem occurs while e.g. dealing with a certain request and WithValues was used for that request, then the error log entry will also contain information about it.

The output changes from unstructured to structured, which might be a breaking change for users who grep for panics. Care was taken to format panics as similar as possible to the original output.

For errors, a message string gets added. There was none before, which made it impossible to find all error output coming from HandleError.

Special notes for your reviewer:

Keeping HandleError and HandleCrash around without deprecating while changing the signature of callbacks is a compromise between not breaking existing code and not adding too many special cases that need to be supported. There is some code which uses PanicHandlers or ErrorHandlers, but less than code that uses the Handle* calls.

In Kubernetes, we want to replace the calls. logcheck v0.8.0 warns about them in code which is supposed to be ontextual. The steps towards that are:

Does this PR introduce a user-facing change?

k8s.io/apimachinery/pkg/util/runtime: new calls support handling panics and errors in the context where they occur. `PanicHandlers` and `ErrorHandlers` now must accept a context parameter for that. Log output is structured instead of unstructured.

/wg structured-logging
/sig instrumentation

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 20, 2023
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Nov 20, 2023
// It expects to be called as <caller> -> HandleError[InContext] -> logError.
func logError(ctx context.Context, err error) {
logger := klog.FromContext(ctx).WithCallDepth(2)
logger.Error(err, "Unhandled error")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn between calling this "Unhandled error" and "Handled error". I chose "Unhandled error" because logging an error is not the same as handling it in code - the whole point of logging at error level is that something unexpected occurred that needs attention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if !strings.Contains(capturedOutput, `timeout or abort while handling: method=GET URI="/" audit-ID=""`) {
has a test for this output.

Before the log entry was unstructured:

E1122 08:24:37.387820    9070 wrap.go:54] timeout or abort while handling: method=GET URI="/" audit-ID=""

Now it is structured:

E1121 21:17:48.088946   61801 wrap.go:54] "Unhandled error" err="timeout or abort while handling: method=GET URI=\"/\" audit-ID=\"\""

Did we intentionally or accidentally make this output part of the apiserver API? In other words, can this output be changed?

Would it make sense to change this further? There are plenty of cases where HandleError is called for an error that gets created via fmt.Errorf - this is unstructured printf-style logging! A HandleErrorInContext(ctx, err, <key/value pairs>) would lead to better output.

/cc @tkashem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like UnhandledError as this is what these usually are. The code variant of 🤷‍♂️, i.e. we cannot sensibly handle that error in the context, but want to log it for visibility that something is wrong.

Copy link
Contributor Author

@pohly pohly Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed an update. When passing message and key/value pairs as separate parameters, the "Unhandled error" has to be added via WithName because the message string is controlled by the caller. The output in the timeout test now is:

"Timeout or abort while handling" logger="UnhandledError" method="GET" URI="/" auditID=""

msg: should consider HandleErrorInContext in new or modified code
- p: \.HandleCrash
pkg: ^k8s\.io/apimachinery/pkg/util/runtime
msg: should consider HandleCrashInContext in new or modified code
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reluctant to declare HandleError and HandlePanic as deprecated. That leads to confusion and unnecessary work in downstream projects which are fine with the existing functions.

Instead, hints that only show up in pull-kubernetes-linter-hints get defined here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the same problem whenever we introduce a new API which should be used instead of some old, still supported one. Even if we were to deprecate the old one, we still wouldn't know exactly where in the existing code base we should replace it.

Let me propose something else:

  • Instead of // Deprecated: Use HandleCrashWithContext for HandleCrash, we add //logcheck:context // Use HandleCrashWithContext instead in code which supports contextual logging..
  • The logcheck linter gets extended to recognize this special comment for functions and methods.
  • If, and only if, some code is expected to be "contextual" (defined in hack/logcheck.conf) and uses such an API, then logcheck will report it.

This is conceptually similar to how it warns about some known klog API calls that shouldn't be used anymore, except that the knowledge about those calls is not baked into the linter.

@sttts how does this sound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one difference compared to the hints: as soon as we mark an API like this, we have to fix code which currently is supposed to be contextual (kube-controller-manager, kube-scheduler, etc.). I think that's doable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in kubernetes-sigs/logtools#25.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong opinion. But yes, sounds reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing the conversion for kube-controller now to verify that it's doable. One observation: there are a lot of HandleError calls which should never be reached (checked type conversions, getting key for object, accessing local cache). I'm not going to convert those because introducing a context in the call chain just for dead code isn't worth the effort and overhead. Instead, I'll add nolint:logcheck with comment to indicate that the call has been checked and is okay in those places.

@pohly pohly force-pushed the log-apimachinery-runtime branch from 8b2a4f6 to 2af8486 Compare November 21, 2023 20:43
@k8s-ci-robot k8s-ci-robot requested a review from tkashem November 22, 2023 07:41
@pohly pohly changed the title apimachinery runtime: support contextual logging RFC: apimachinery runtime: support contextual logging Nov 22, 2023
// E.g., you can provide one or more additional handlers for something like shutting down go routines gracefully.
//
// The context is used to determine how to log. Should be used instead of HandleCrash.
func HandleCrashInContext(ctx context.Context, additionalHandlers ...func(context.Context, interface{})) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithContext is common, not InContext

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. My thinking was that "handle the error in the context in which it occurred" is a better description of what the function does. Having said that, HandleWithContext is also okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@pohly pohly force-pushed the log-apimachinery-runtime branch from 2af8486 to 700db5b Compare November 22, 2023 13:50
@pohly pohly force-pushed the log-apimachinery-runtime branch from 700db5b to 8a48038 Compare November 22, 2023 15:32
@pohly pohly changed the title RFC: apimachinery runtime: support contextual logging WIP: apimachinery runtime: support contextual logging Nov 23, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2023
@pohly pohly force-pushed the log-apimachinery-runtime branch from 8a48038 to 80e7c3d Compare November 27, 2023 16:37
@dgrisonnet
Copy link
Member

/assign @logicalhan

@dgrisonnet
Copy link
Member

/cc

@pohly
Copy link
Contributor Author

pohly commented Mar 8, 2024

/remove lifecycle-rotten

The logcheck update which handles //logcheck:context got merged. We can proceed now with this PR as explained in the commit message (merge the PR, update existing code, the removed the TODO with currently disables enforcement).

@sttts: can you perhaps take another look?

@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@@ -35,18 +36,20 @@ var (
)

// PanicHandlers is a list of functions which will be invoked when a panic happens.
var PanicHandlers = []func(interface{}){logPanic}
var PanicHandlers = []func(context.Context, interface{}){logPanic}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many projects set this manually? My expecation is nearly nobody. Can you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find four places in controller-runtime, Cilium, Gardener: https://grep.app/search?q=utilruntime.PanicHandlers&filter[path][0]=pkg/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that they use the utilruntime import name. Might be a bit more which don't use it like that.


// HandleCrash simply catches a crash and logs an error. Meant to be called via
// defer. Additional context-specific handlers can be provided, and will be
// called in case of panic. HandleCrash actually crashes, after calling the
// handlers and logging the panic message.
//
// E.g., you can provide one or more additional handlers for something like shutting down go routines gracefully.
//
// TODO(pohly): logcheck:context // HandleCrashWithContext should be used instead of HandleCrash in code which supports contextual logging.
func HandleCrash(additionalHandlers ...func(interface{})) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deprecate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we express HandleCrash through HandleCrashWithContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deprecate?

https://grep.app/search?q=utilruntime.HandleCrash shows that HandleCrash is used quite a bit outside of Kubernetes. Marking it as deprecated IMHO is too strong: it implies that we expect all of those downstream consumers of utilruntime to change their code although it works perfectly fine for them.

It's only code which cares about contextual logging which should use HandleErrorWithContext.

Can we express HandleCrash through HandleCrashWithContext?

That would interfere with stack unwinding in callbacks, for example logError: as it stands now, they know that they need to skip HandleCrash or HandleCrashWithContext. If HandleCrash calls HandleCrashWithContext, that adds another level and the callback won't know that.

}
}

// ErrorHandlers is a list of functions which will be invoked when a nonreturnable
// error occurs.
// TODO(lavalamp): for testability, this and the below HandleError function
// should be packaged up into a testable and reusable object.
var ErrorHandlers = []func(error){
var ErrorHandlers = []ErrorHandler{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as above: what does a github search tell about projects using this variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is used a bit more, I find 16 places in 8 projects: https://grep.app/search?q=utilruntime.ErrorHandlers

Same caveat about using that import name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add ErrorHandlersWithContext. Would be easy enough. No super strong opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might have worked, except for one particular usage:
https://github.com/siderolabs/talos/blob/b752a86183c990848f05d86fa1b41942b4f1610c/internal/app/machined/pkg/controllers/k8s/k8s.go#L15

If we add "our" default error handler to such a new ErrorHandlersWithContext, then that code above compiles, but then crashes at runtime. I think I prefer breaking the API at compile time - but no strong opinion either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one. Sure that will break whatever we do.

// HandlerError is a method to invoke when a non-user facing piece of code cannot
// return an error and needs to indicate it has been ignored. Invoking this method
// is preferable to logging the error - the default behavior is to log but the
// errors may be sent to a remote server for analysis.
//
// TODO(pohly): logcheck:context // HandleErrorWithContext should be used instead of HandleError in code which supports contextual logging.
func HandleError(err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above: express HandleError though HandleErrorWithContext. No need to duplicate code.

And deprecate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed that before in #121970 (review). I would still prefer to deprecate, but not strong opinion and definitely no blocker.

@pohly pohly force-pushed the log-apimachinery-runtime branch from 828fe2b to a71cb8d Compare March 26, 2024 14:07
handleError(context.Background(), err, "Unhandled Error")
}

// HandlerError is a method to invoke when a non-user facing piece of code cannot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: HandleErrorWithContext

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@sttts
Copy link
Contributor

sttts commented Mar 26, 2024

/lgtm
/approve
/hold
if you want to fix the typo in the go doc.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 26, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 26, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 5896b3f5e5f149d2c0baf7936958c5d4e976b727

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 26, 2024
In contrast to the original HandleError and HandleCrash, the new
HandleErrorWithContext and HandleCrashWithContext functions properly do contextual
logging, so if a problem occurs while e.g. dealing with a certain request and
WithValues was used for that request, then the error log entry will also
contain information about it.

The output changes from unstructured to structured, which might be a breaking
change for users who grep for panics. Care was taken to format panics
as similar as possible to the original output.

For errors, a message string gets added. There was none before, which made it
impossible to find all error output coming from HandleError.

Keeping HandleError and HandleCrash around without deprecating while changing
the signature of callbacks is a compromise between not breaking existing code
and not adding too many special cases that need to be supported. There is some
code which uses PanicHandlers or ErrorHandlers, but less than code that uses
the Handle* calls.

In Kubernetes, we want to replace the calls. logcheck warns about them in code
which is supposed to be contextual. The steps towards that are:
- add TODO remarks as reminder (this commit)
- locally remove " TODO(pohly): " to enable the check with `//logcheck:context`,
  merge fixes for linter warnings
- once there are none, remove the TODO to enable the check permanently
@pohly pohly force-pushed the log-apimachinery-runtime branch from a71cb8d to 5a130d2 Compare March 26, 2024 16:28
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 26, 2024
@pohly
Copy link
Contributor Author

pohly commented Mar 26, 2024

I prefer to fix typos - done, please re-add LGTM.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 26, 2024
@sttts
Copy link
Contributor

sttts commented Mar 27, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 27, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: fdf757d98e43a5bb133fc3bb22c187f5774a9e3e

@k8s-ci-robot k8s-ci-robot merged commit d35ba36 into kubernetes:master Apr 18, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Apr 18, 2024
mergenci added a commit to mergenci/upjet that referenced this pull request Jan 8, 2025
References:
kubernetes/kubernetes#121970
kubernetes/apimachinery@126f5ce

Signed-off-by: Cem Mergenci <cmergenci@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/dependency Issues or PRs related to dependency changes cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants