-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: update otel dependency and add tail sampling processing #18134
Conversation
fcad218
to
472d9c8
Compare
/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Haven't tried it out, but +1
OpenTelemetryInsecure bool `yaml:"opentelemetryinsecure,omitempty"` | ||
OpenTelemetryStackTraces bool `yaml:"opentelemetrystacktraces,omitempty"` | ||
OpenTelemetrySampleRatio string `yaml:"opentelemetrysampleratio,omitempty"` | ||
OpenTelemetryTailSamplingThreshold time.Duration `yaml:"opentelemetrytailsamplingthreshold,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when are we starting format-3.0.go
? 🤮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I regret having multiple values in the config. What I want is a json/yaml file with the options. The only problem there is that it makes the modelling a lot weaker. Although potentially this should be driven by the controller charm.
@@ -450,6 +454,10 @@ const ( | |||
// By default we only want to trace 10% of the requests. | |||
DefaultOpenTelemetrySampleRatio = 0.1 | |||
|
|||
// DefaultOpenTelemetryTailSamplingThreshold is the default value for the | |||
// tail sampling threshold for open telemetry. | |||
DefaultOpenTelemetryTailSamplingThreshold = 1 * time.Microsecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this pretty much mean sample everything? That's fine, just trying to understand the defaults (as we might need to update any "running juju in production" docs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% we should re-address this before production. I've picked the defaults I wanted now, but they are not necessarily for production.
@@ -129,10 +138,19 @@ func NewClient(ctx context.Context, namespace coretrace.TaggedTracerNamespace, e | |||
return nil, nil, nil, errors.Trace(err) | |||
} | |||
|
|||
bsp := sdktrace.NewBatchSpanProcessor(exporter, | |||
sdktrace.WithMaxExportBatchSize(512), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reasoning behind these magic numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these are both exported and defaults... https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was playing around with these, because I want to make these configurable.
threshold time.Duration | ||
} | ||
|
||
// OnStart is called when a span is started. It is called synchronously and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the batch span processor has BlockOnQueueFull: false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh looks like OnStart with a batch span processor is a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I can't guarantee that in the future, so just pass through, even if we know it's currently a no-op.
return | ||
} | ||
|
||
// If the span duration is less than the threshold, we want to drop it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so setting a threshold of 0s
is how you sample everything. I guess we might at some point want more configuration options to control the batch span processor to make it bigger or block when the queue is full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100%, but we need to start somewhere.
8cea830
to
9b30a73
Compare
9b30a73
to
d9f8e0b
Compare
d9f8e0b
to
5dc75f5
Compare
/merge |
We're falling behind in terms of the latest library release. This moves from 1.21 to 1.30.
This implements tail sampling processing without bringing in a LOT of dependencies. We essentially just want to sample spans that are interesting to us.
5dc75f5
to
ba610d3
Compare
/merge |
Implements tail sampling processing on spans. This wraps the batch processor that we already have for spans. By wrapping the batch processor, we can prevent spans from being sent to the processor if they're not "interesting". Interesting is determined by whether the span has an error attributed or is over a specified threshold. With the combination of the sampling ratio and the tail sampling processing, it's possible to identify issues with controllers. When setting the ratio to near 1.0, it floods juju and the tempo server with too many spans, causing juju to become slow and unresponsive. Adding tail sampling into the mix, allows the recording of spans that could be interesting: slow requests, database queries taking too long, too many errors etc.
Checklist
QA steps
Setting up tempo
It assumes that docker (docker-compose) is correctly installed:
$ git clone https://github.com/grafana/tempo.git $ cd tempo/example/docker-compose/local $ docker compose up -d
Juju
Replace
<IP ADDRESS>
with your host machine (not localhost) address (lxc info | yq ".environment | .addresses"
is a good place to start looking)Tempo Explore
Open the following Tempo Explore Dashboard