Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update otel dependency and add tail sampling processing #18134

Merged
merged 4 commits into from
Oct 17, 2024

Conversation

SimonRichardson
Copy link
Member

@SimonRichardson SimonRichardson commented Sep 23, 2024

Implements tail sampling processing on spans. This wraps the batch processor that we already have for spans. By wrapping the batch processor, we can prevent spans from being sent to the processor if they're not "interesting". Interesting is determined by whether the span has an error attributed or is over a specified threshold. With the combination of the sampling ratio and the tail sampling processing, it's possible to identify issues with controllers. When setting the ratio to near 1.0, it floods juju and the tempo server with too many spans, causing juju to become slow and unresponsive. Adding tail sampling into the mix, allows the recording of spans that could be interesting: slow requests, database queries taking too long, too many errors etc.

Checklist

  • Code style: imports ordered, good names, simple structure, etc
  • Comments saying why design decisions were made
  • Go unit tests, with comments saying what you're testing

QA steps

Setting up tempo

It assumes that docker (docker-compose) is correctly installed:

$ git clone https://github.com/grafana/tempo.git
$ cd tempo/example/docker-compose/local
$ docker compose up -d

Juju

Replace <IP ADDRESS> with your host machine (not localhost) address (lxc info | yq ".environment | .addresses" is a good place to start looking)

$ juju bootstrap lxd test --build-agent --config="open-telemetry-enabled=true" --config="open-telemetry-insecure=true" --config="open-telemetry-endpoint=<IP ADDRESS>:4317" --config="open-telemetry-sample-ratio=1.0" --config "open-telemetry-tail-sampling-threshold=20ms"

Tempo Explore

Open the following Tempo Explore Dashboard

@SimonRichardson SimonRichardson changed the title Update opentelemetry dep feat: update otel dependency and add tail sampling processing Sep 23, 2024
@hpidcock hpidcock added the 4.0 label Sep 23, 2024
@SimonRichardson SimonRichardson force-pushed the update-opentelemetry-dep branch 4 times, most recently from fcad218 to 472d9c8 Compare September 25, 2024 11:39
@SimonRichardson SimonRichardson self-assigned this Sep 25, 2024
@SimonRichardson SimonRichardson marked this pull request as ready for review September 25, 2024 11:50
@SimonRichardson
Copy link
Member Author

/build

Copy link
Member

@hpidcock hpidcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Haven't tried it out, but +1

OpenTelemetryInsecure bool `yaml:"opentelemetryinsecure,omitempty"`
OpenTelemetryStackTraces bool `yaml:"opentelemetrystacktraces,omitempty"`
OpenTelemetrySampleRatio string `yaml:"opentelemetrysampleratio,omitempty"`
OpenTelemetryTailSamplingThreshold time.Duration `yaml:"opentelemetrytailsamplingthreshold,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when are we starting format-3.0.go? 🤮

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I regret having multiple values in the config. What I want is a json/yaml file with the options. The only problem there is that it makes the modelling a lot weaker. Although potentially this should be driven by the controller charm.

@@ -450,6 +454,10 @@ const (
// By default we only want to trace 10% of the requests.
DefaultOpenTelemetrySampleRatio = 0.1

// DefaultOpenTelemetryTailSamplingThreshold is the default value for the
// tail sampling threshold for open telemetry.
DefaultOpenTelemetryTailSamplingThreshold = 1 * time.Microsecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pretty much mean sample everything? That's fine, just trying to understand the defaults (as we might need to update any "running juju in production" docs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% we should re-address this before production. I've picked the defaults I wanted now, but they are not necessarily for production.

@@ -129,10 +138,19 @@ func NewClient(ctx context.Context, namespace coretrace.TaggedTracerNamespace, e
return nil, nil, nil, errors.Trace(err)
}

bsp := sdktrace.NewBatchSpanProcessor(exporter,
sdktrace.WithMaxExportBatchSize(512),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reasoning behind these magic numbers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was playing around with these, because I want to make these configurable.

threshold time.Duration
}

// OnStart is called when a span is started. It is called synchronously and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the batch span processor has BlockOnQueueFull: false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh looks like OnStart with a batch span processor is a no-op.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I can't guarantee that in the future, so just pass through, even if we know it's currently a no-op.

return
}

// If the span duration is less than the threshold, we want to drop it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so setting a threshold of 0s is how you sample everything. I guess we might at some point want more configuration options to control the batch span processor to make it bigger or block when the queue is full.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%, but we need to start somewhere.

@SimonRichardson
Copy link
Member Author

/merge

We're falling behind in terms of the latest library release. This
moves from 1.21 to 1.30.
This implements tail sampling processing without bringing in a LOT
of dependencies. We essentially just want to sample spans that are
interesting to us.
@SimonRichardson
Copy link
Member Author

/merge

@jujubot jujubot merged commit 6ee57ec into juju:main Oct 17, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants