feat: update otel dependency and add tail sampling processing #18134

SimonRichardson · 2024-09-23T21:16:13Z

Implements tail sampling processing on spans. This wraps the batch processor that we already have for spans. By wrapping the batch processor, we can prevent spans from being sent to the processor if they're not "interesting". Interesting is determined by whether the span has an error attributed or is over a specified threshold. With the combination of the sampling ratio and the tail sampling processing, it's possible to identify issues with controllers. When setting the ratio to near 1.0, it floods juju and the tempo server with too many spans, causing juju to become slow and unresponsive. Adding tail sampling into the mix, allows the recording of spans that could be interesting: slow requests, database queries taking too long, too many errors etc.

Checklist

Code style: imports ordered, good names, simple structure, etc
Comments saying why design decisions were made
Go unit tests, with comments saying what you're testing

QA steps

Setting up tempo

It assumes that docker (docker-compose) is correctly installed:

$ git clone https://github.com/grafana/tempo.git
$ cd tempo/example/docker-compose/local
$ docker compose up -d

Juju

Replace <IP ADDRESS> with your host machine (not localhost) address (lxc info | yq ".environment | .addresses" is a good place to start looking)

$ juju bootstrap lxd test --build-agent --config="open-telemetry-enabled=true" --config="open-telemetry-insecure=true" --config="open-telemetry-endpoint=<IP ADDRESS>:4317" --config="open-telemetry-sample-ratio=1.0" --config "open-telemetry-tail-sampling-threshold=20ms"

Tempo Explore

Open the following Tempo Explore Dashboard

SimonRichardson · 2024-09-25T11:50:12Z

/build

hpidcock

Cool. Haven't tried it out, but +1

hpidcock · 2024-09-26T04:36:47Z

agent/format-2.0.go

+	OpenTelemetryInsecure              bool          `yaml:"opentelemetryinsecure,omitempty"`
+	OpenTelemetryStackTraces           bool          `yaml:"opentelemetrystacktraces,omitempty"`
+	OpenTelemetrySampleRatio           string        `yaml:"opentelemetrysampleratio,omitempty"`
+	OpenTelemetryTailSamplingThreshold time.Duration `yaml:"opentelemetrytailsamplingthreshold,omitempty"`


So when are we starting format-3.0.go? 🤮

I regret having multiple values in the config. What I want is a json/yaml file with the options. The only problem there is that it makes the modelling a lot weaker. Although potentially this should be driven by the controller charm.

hpidcock · 2024-09-26T04:38:22Z

controller/config.go

@@ -450,6 +454,10 @@ const (
 	// By default we only want to trace 10% of the requests.
 	DefaultOpenTelemetrySampleRatio = 0.1

+	// DefaultOpenTelemetryTailSamplingThreshold is the default value for the
+	// tail sampling threshold for open telemetry.
+	DefaultOpenTelemetryTailSamplingThreshold = 1 * time.Microsecond


Does this pretty much mean sample everything? That's fine, just trying to understand the defaults (as we might need to update any "running juju in production" docs).

100% we should re-address this before production. I've picked the defaults I wanted now, but they are not necessarily for production.

hpidcock · 2024-09-26T04:45:16Z

internal/worker/trace/client.go

@@ -129,10 +138,19 @@ func NewClient(ctx context.Context, namespace coretrace.TaggedTracerNamespace, e
 		return nil, nil, nil, errors.Trace(err)
 	}

+	bsp := sdktrace.NewBatchSpanProcessor(exporter,
+		sdktrace.WithMaxExportBatchSize(512),


Any reasoning behind these magic numbers?

Looks like these are both exported and defaults... https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go

I was playing around with these, because I want to make these configurable.

hpidcock · 2024-09-26T04:51:10Z

internal/worker/trace/client.go

+	threshold time.Duration
+}
+
+// OnStart is called when a span is started. It is called synchronously and


Because the batch span processor has BlockOnQueueFull: false?

Oh looks like OnStart with a batch span processor is a no-op.

Yeah, but I can't guarantee that in the future, so just pass through, even if we know it's currently a no-op.

hpidcock · 2024-09-26T04:54:07Z

internal/worker/trace/client.go

+		return
+	}
+
+	// If the span duration is less than the threshold, we want to drop it.


OK, so setting a threshold of 0s is how you sample everything. I guess we might at some point want more configuration options to control the batch span processor to make it bigger or block when the queue is full.

100%, but we need to start somewhere.

SimonRichardson · 2024-10-17T12:17:26Z

/merge

We're falling behind in terms of the latest library release. This moves from 1.21 to 1.30.

This implements tail sampling processing without bringing in a LOT of dependencies. We essentially just want to sample spans that are interesting to us.

SimonRichardson · 2024-10-17T13:07:30Z

/merge

SimonRichardson changed the title ~~Update opentelemetry dep~~ feat: update otel dependency and add tail sampling processing Sep 23, 2024

hpidcock added the 4.0 label Sep 23, 2024

SimonRichardson force-pushed the update-opentelemetry-dep branch 4 times, most recently from fcad218 to 472d9c8 Compare September 25, 2024 11:39

SimonRichardson self-assigned this Sep 25, 2024

SimonRichardson marked this pull request as ready for review September 25, 2024 11:50

hpidcock approved these changes Sep 26, 2024

View reviewed changes

SimonRichardson force-pushed the update-opentelemetry-dep branch 2 times, most recently from 8cea830 to 9b30a73 Compare October 1, 2024 09:12

SimonRichardson mentioned this pull request Oct 1, 2024

test: perf worker #18007

Closed

3 tasks

hpidcock added the has merge conflicts label Oct 7, 2024

SimonRichardson force-pushed the update-opentelemetry-dep branch from 9b30a73 to d9f8e0b Compare October 8, 2024 09:42

SimonRichardson removed the has merge conflicts label Oct 8, 2024

SimonRichardson requested a review from manadart October 11, 2024 14:15

hpidcock added the has merge conflicts label Oct 14, 2024

SimonRichardson force-pushed the update-opentelemetry-dep branch from d9f8e0b to 5dc75f5 Compare October 14, 2024 09:58

SimonRichardson removed the has merge conflicts label Oct 14, 2024

manadart approved these changes Oct 17, 2024

View reviewed changes

SimonRichardson added 4 commits October 17, 2024 13:34

chore: update opentelemetry dependency

9fc73a4

We're falling behind in terms of the latest library release. This moves from 1.21 to 1.30.

feat: implement tail sampling processing

078bf48

This implements tail sampling processing without bringing in a LOT of dependencies. We essentially just want to sample spans that are interesting to us.

feat: add tail sampling threshold to config

9c0ff87

chore: regenerate mocks

ba610d3

SimonRichardson force-pushed the update-opentelemetry-dep branch from 5dc75f5 to ba610d3 Compare October 17, 2024 12:35

jujubot merged commit 6ee57ec into juju:main Oct 17, 2024
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update otel dependency and add tail sampling processing #18134

feat: update otel dependency and add tail sampling processing #18134

SimonRichardson commented Sep 23, 2024 •

edited

Loading

SimonRichardson commented Sep 25, 2024

hpidcock left a comment

hpidcock Sep 26, 2024

SimonRichardson Sep 26, 2024

hpidcock Sep 26, 2024

SimonRichardson Sep 26, 2024

hpidcock Sep 26, 2024

hpidcock Sep 26, 2024

SimonRichardson Sep 26, 2024

hpidcock Sep 26, 2024

hpidcock Sep 26, 2024

SimonRichardson Sep 26, 2024

hpidcock Sep 26, 2024

SimonRichardson Sep 26, 2024

SimonRichardson commented Oct 17, 2024

SimonRichardson commented Oct 17, 2024

feat: update otel dependency and add tail sampling processing #18134

feat: update otel dependency and add tail sampling processing #18134

Conversation

SimonRichardson commented Sep 23, 2024 • edited Loading

Checklist

QA steps

Setting up tempo

Juju

Tempo Explore

SimonRichardson commented Sep 25, 2024

hpidcock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonRichardson commented Oct 17, 2024

SimonRichardson commented Oct 17, 2024

SimonRichardson commented Sep 23, 2024 •

edited

Loading