Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing and log correlation without otel #43718

Merged
merged 12 commits into from
Jan 14, 2025

Conversation

lmolkova
Copy link
Member

@lmolkova lmolkova commented Jan 7, 2025

The next PR in the series of client core tracing+logging improvements.

In this episode, we:

  • Implement fallback tracing which:
    • generates trace context - it will be used to correlate logs between each other
    • records spans as logs (if users enable corresponding logger)
  • Introduce InstrumentationContext - strongly typed container for trace context that's used to correlate logs with or without otel
  • Merge HttpInstrumentationPolicy with HttpLoggingPolicy. This is not strictly necessary, but allows to
    • merge similar configuration like URL sanitization and allowed headers (not yet done in this PR)
    • optimize code (e.g. do sanitization once)
    • keep things consistent (e.g. span attributes and logging keys)
    • it will be even more useful when we add metrics - we'd reuse the same duration and other things that we log today
  • Improves logging in HttpRetryPolicy and HttpRedirectPolicy
    • makes it similar to HTTP requests
    • consistent attributes following OTel semantic conventions (where possible)
    • enables correlation
  • Moves logging into instrumentation package

In next episodes:

  • Relationship between HttpLogOptions and InstrumetnationOptions?
  • HTTP spans: time-to-last-byte
  • Metrics

@lmolkova lmolkova changed the title Log based tracing without otel Tracing and log correlation without otel Jan 7, 2025
@lmolkova
Copy link
Member Author

lmolkova commented Jan 7, 2025

Fallback tracing benchmarks

FallbackTracingBenchmarks.fallbackTracerDisabled         avgt   15     0.499 ±  0.005  ns/op
FallbackTracingBenchmarks.fallbackTracerEnabledNoLogs    avgt   15    47.193 ±  2.478  ns/op
FallbackTracingBenchmarks.fallbackTracerEnabledWithLogs  avgt   15  1349.738 ± 46.289  ns/op

So with fallback tracing (no logs) enabled by default, we're going to create ~2 spans for client SDK code in common case (1 public API-surface and 1 HTTP), adding ~90ns to each client call along with some allocations which seems reasonable.

This time is essentially spent on generating random Ids and we'd probably get a comparable impact from x-ms-client-id thing.

@azure-sdk
Copy link
Collaborator

azure-sdk commented Jan 7, 2025

API change check

API changes are not detected in this pull request.

@lmolkova
Copy link
Member Author

lmolkova commented Jan 7, 2025

Champion scenario:

  var httpPipeline = new HttpPipelineBuilder()
          .policies(new HttpRetryPolicy(), new HttpRedirectPolicy(), new HttpInstrumentationPolicy(null, null))
          .build();

  // this would effectively be done by codegen in client libs (when they create a span)
  // but users can also provide context explicitly and we'll use it.
  // we also support creating this context from OTel spans
  RequestOptions requestOptions = new RequestOptions().setInstrumentationContext(createRandomContext());

  HttpRequest request = new HttpRequest(HttpMethod.GET, "http://www.microsoft.com")
          .setRequestOptions(requestOptions);

  httpPipeline.send(request);

logs with simulated error (formatted, verbose)

2025-01-06 19:04:44.865 [main] [DEBUG] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.body.size": 0,
  "span.id": "f913333fbc212845",
  "http.request.resend_count": 0,
  "url.full": "http://www.microsoft.com",
  "event.name": "http.request"
}

2025-01-06 19:04:44.866 [main] [DEBUG] core.tracing - 
{
  "library.name": "core",
  "library.version": "1.0.0-beta.2",
  "span.name": "GET",
  "span.duration": 5.2217,
  "span.id": "f913333fbc212845",
  "server.address": "www.microsoft.com",
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "error.type": "java.net.UnknownHostException",
  "http.request.method": "GET",
  "span.kind": "CLIENT",
  "server.port": 80,
  "span.parent.id": "72d954337db64062",
  "url.full": "http://www.microsoft.com",
  "event.name": "span.ended"
}

2025-01-06 19:04:44.866 [main] [WARN] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.duration": 7.4057,
  "exception.type": "java.net.UnknownHostException",
  "exception.message": null,
  "http.request.body.size": 0,
  "span.id": "f913333fbc212845",
  "http.request.resend_count": 0,
  "url.full": "http://www.microsoft.com",
  "exception.stacktrace": "java.net.UnknownHostException\r\n\tat com.test/com.test.App.lambda$main$0(App.java:32)\r\n\tat io.clientcore.core@1.0.0-beta.2/io.clientcore.core.http.pipeline.HttpPipelineNextPolicy...",
  "event.name": "http.response"
}

2025-01-06 19:04:44.868 [main] [DEBUG] io.clientcore.core.http.pipeline.HttpRetryPolicy - 
{
  "retry.was_last_attempt": false,
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "retry.delay": 824,
  "exception.type": "java.io.UncheckedIOException",
  "exception.message": "java.net.UnknownHostException",
  "retry.max_attempt_count": 3,
  "span.id": "72d954337db64062",
  "http.request.resend_count": 0,
  "exception.stacktrace": "java.io.UncheckedIOException: java.net.UnknownHostException\r\n\tat com.test/com.test.App.lambda$main$0(App.java:32)\r\n\tat ....",
  "event.name": "http.retry"
}

2025-01-06 19:04:45.694 [main] [DEBUG] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.body.size": 0,
  "span.id": "7f622788d28f645f",
  "http.request.resend_count": 1,
  "url.full": "http://www.microsoft.com",
  "event.name": "http.request"
}

2025-01-06 19:04:45.803 [main] [INFO] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - {
  "http.request.time_to_response": 108.7054,
  "http.response.body.size": 201253,
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.duration": 108.7196,
  "http.request.body.size": 0,
  "span.id": "7f622788d28f645f",
  "http.request.resend_count": 1,
  "url.full": "http://www.microsoft.com",
  "http.response.status_code": 200,
  "event.name": "http.response"
}

2025-01-06 19:04:45.803 [main] [DEBUG] core.tracing - {
  "library.name": "core",
  "library.version": "1.0.0-beta.2",
  "server.address": "www.microsoft.com",
  "span.name": "GET",
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "span.duration": 108.9267,
  "http.request.method": "GET",
  "span.kind": "CLIENT",
  "server.port": 80,
  "span.parent.id": "f913333fbc212845",
  "span.id": "7f622788d28f645f",
  "http.request.resend_count": 1,
  "url.full": "http://www.microsoft.com",
  "http.response.status_code": 200,
  "event.name": "span.ended"
}

or with simulated error and redirect (formatted, info)

2025-01-06 19:21:08.735 [main] [WARN] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "dfc10b5460554c87ba9dc6c5a1fdda1c",
  "http.request.method": "GET",
  "http.request.duration": 6.5101,
  "exception.type": "java.net.UnknownHostException",
  "exception.message": null,
  "http.request.body.size": 0,
  "span.id": "58fea390ed94a20b",
  "http.request.resend_count": 0,
  "url.full": "http://microsoft.com",
  "event.name": "http.response"
}

2025-01-06 19:21:09.648 [main] [INFO] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "http.request.time_to_response": 121.9558,
  "http.response.body.size": 0,
  "trace.id": "dfc10b5460554c87ba9dc6c5a1fdda1c",
  "http.request.method": "GET",
  "http.request.duration": 121.9866,
  "http.request.body.size": 0,
  "span.id": "0ef927d10d392086",
  "http.request.resend_count": 1,
  "url.full": "http://microsoft.com",
  "http.response.status_code": 301,
  "event.name": "http.response"
}

2025-01-06 19:21:09.885 [main] [INFO] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "http.request.time_to_response": 236.4778,
  "http.response.body.size": 201253,
  "trace.id": "dfc10b5460554c87ba9dc6c5a1fdda1c",
  "http.request.method": "GET",
  "http.request.duration": 236.4934,
  "http.request.body.size": 0,
  "span.id": "3016d2e453fc9874",
  "http.request.resend_count": 1,
  "url.full": "https://www.microsoft.com/",
  "http.response.status_code": 200,
  "event.name": "http.response"
}

So new RequestOptions().setInstrumentationContext(createRandomContext()); brings correlation to all HTTP logs (including retries and redirects) in this pipeline.
It works with and without otel.

We can easily extend it to support implicit context propagation (which already works with otel).

@lmolkova lmolkova force-pushed the log-based-tracing-without-otel branch from bcc6c9c to 4db1075 Compare January 11, 2025 02:37
@lmolkova lmolkova marked this pull request as ready for review January 11, 2025 02:40
@lmolkova lmolkova force-pushed the log-based-tracing-without-otel branch from 4db1075 to 3464c9b Compare January 13, 2025 23:31
@lmolkova lmolkova merged commit 70e1ebe into Azure:main Jan 14, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants