Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add source metadata RFC #15

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

adleong
Copy link
Member

@adleong adleong commented Apr 15, 2020

When a meshed pod sends outgoing HTTP requests the Linkerd proxy records metrics
such as latency and request counters and scopes those metrics by the
destination. This is done by setting a dst_X label on the Prometheus metrics
where X is the destination workload kind. On the other hand, when a meshed
pod receives incoming HTTP requests, there is no equivalent scoping of the
metrics by source. In other words, there is no src_X Prometheus label, making
it impossible to break down metrics for incoming traffic by source. This RFC
proposes adding src_X Prometheus labels for incoming HTTP traffic.

Signed-off-by: Alex Leong alex@buoyant.io

Signed-off-by: Alex Leong <alex@buoyant.io>
@siggy
Copy link
Member

siggy commented Apr 15, 2020

To confirm my own understanding, this proposes to modify all request/response metrics exported via the linkerd-proxy's /metrics endpoint to include the following label keys:

  • direction
  • src_workload_kind
  • src_workload_name
  • dst_workload_kind
  • dst_workload_name

This implies three changes to the proxy /metrics endpoint:

  • adding src_* labels to inbound request/response metrics
  • adding dst_* labels to inbound request/response metrics
  • adding src_* labels to outbound request/response metrics

Context: linkerd/linkerd2#4101, linkerd/linkerd2#4102

@olix0r
Copy link
Member

olix0r commented Apr 15, 2020

  • Are src labels added to outbound metrics?
  • Are dst labels added to inbound metrics?

@siggy note that it won't be all metrics. it will be metrics for all meshed communication. (We can't have dst metrics for services that bypass discovery).

Copy link
Member

@siggy siggy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments but overall this is awesome! My reading of this is that it will solve both linkerd/linkerd2#4101 and linkerd/linkerd2#4102.

design/0003-source-metadata.md Outdated Show resolved Hide resolved
design/0003-source-metadata.md Outdated Show resolved Hide resolved
design/0003-source-metadata.md Show resolved Hide resolved
design/0003-source-metadata.md Show resolved Hide resolved
@adleong
Copy link
Member Author

adleong commented Apr 15, 2020

@siggy @olix0r Yes, both inbound and outbound will have both dst and src metrics. For inbound, the dst_X labels will be set to the pod's own workload (ie the labels loaded from the local file) and for outbound the src_X labels will be set to the pod's own workload.

I tried to make this explicit in the text of the design proposal and in the examples contained therein. If you have any suggestions for how I can make this more clear, please let me know.

design/0003-source-metadata.md Outdated Show resolved Hide resolved
design/0003-source-metadata.md Show resolved Hide resolved
design/0003-source-metadata.md Show resolved Hide resolved
design/0003-source-metadata.md Show resolved Hide resolved
design/0003-source-metadata.md Show resolved Hide resolved
design/0003-source-metadata.md Outdated Show resolved Hide resolved
There is also precedent in Linkerd for proxies to communicate metadata to one
another using the `l5d-*` headers. For example, the fully qualified authority
is communicated between proxies using the `l5d-dst-canonical` header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-mixer istio actually does this differently still. Have you taken a peek at how they're doing it now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't. I'll check it out.

Signed-off-by: Alex Leong <alex@buoyant.io>

[problem-statement]: #problem-statement

Linkerd is not able to add `src_X` labels today because it simply has no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I understand this limitation. But can we be clearer about the problem here? What specific features / use cases are we not able to satisfy without these metrics?

Especially because this proposal only includes this metadata on meshed connections (so we're guaranteed to have the client metrics).

What do these labels enable, concretely, that we can't do otherwise? Or if we do not add them, what risks do we incur?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next paragraph talks about the limitations we face from not having these metrics.

Comment on lines 32 to 38
This asymmetry in metadata can be very limiting when doing queries. It is
impossible to determine who the clients of a resource are by looking at that
resource's metrics alone. Instead, we need to query the outbound metrics of all
other resource to find a client with the appropriate `dst_X` label. Not only
does this make the query awkward, it also means that resource-to-resource
metrics can only be observed on the client side, never on the server side. This
limits our ability to measure network latency.
Copy link
Member

@olix0r olix0r Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So? Is the problem "this is awkward?" or are there fundamental limitations that we incur here?

I want to be careful about looking at this benefit versus the cost of the implementation...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some explicit examples of questions that we can't answer without source metadata.

Signed-off-by: Alex Leong <alex@buoyant.io>
Comment on lines 51 to 57
* For traffic from unmeshed sources, the problem is even worse. In this case
we don't have client-side metrics at all and can't display any metrics for
this traffic. Introducing source metadata would allow us to distinguish
between meshed and unmeshed traffic on the server side. While we would not
be able to distinguish between different unmeshed sources, we would at least
be able to show metrics for traffic from all unmeshed sources aggregated
together.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this RFC actually addresses this issue... We already have client IDs on server metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, though we don't use that data for this purpose today. We could use it, but it would be more straightforward to use source metadata. While this would not impact consumers of the API, I still think that having a cleaner implementation is of some value.

However, I have removed this point for now.

Comment on lines 46 to 50
* It is difficult to present traffic metrics in a consistent way: top line
resource metrics are measured on the server-side by default, but in order to
view a breakdown of these metrics by source, the absesnse of source metadata
means that we have to switch to displaying client-side metrics. This is
confusing at best and misleading at worst.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a bit squishy... It's difficult, but we do it. I.e. for consumers of our API, there's no overhead. Right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not present traffic metrics in a consistent way today. We're forced to either hide the distinction between server-side and client-side metrics, which is deceitful, or be explicit that certain queries are measured on one side while other queries are measured on the other, which is confusing. Today we split the difference by being vague. So I see this a problem for consumers of the API.

Comment on lines 43 to 45
* We cannot compare client-side and server-side metrics for the same traffic to
identify latency or errors introduced between the two Linkerd proxies (e.g. by
the network or by other intermediary proxies)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put a finer point on this... there's a specific API call for which we return incorrect data, right? What is that API Call?

I think this may be the only problem we're trying to address with this proposal. I'd try to take out much of the other context about labels and asymmetry and focus very concretely on this issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's quite right. I don't think any of our API calls return incorrect data, their behavior is just confusing and non-uniform.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sounds like its our metrics API that is the problem: that we have no way to distinguish these concepts at all.

To put it another way: if we made the changes proposed here, what impact would that have on our metrics API? Would we just subtly change the semantics of an existing call? Would we be unable to represent this call?

I view Linkerd's prometheus data as diagnostics... like log lines, basically. The format & details may change between releases as long as we satisfy our metrics API, which powers consumers within Linkerd. So, a problem statement phrased in terms of our prometheus data isn't all that compelling.

However, it absolutely does sound like a problem if there are fundamental questions that our API cannot answer about mesh traffic. How do we improve Linkerd's metrics API so that it can properly differentiate client and serverside metrics? If this requires source labels, great, I'm all in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked the RFC to be API focused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the updated framing.

If it's alright with you, I'm going to suggest some further edits via PR to contextualize how I see this all fitting together.

adleong added 2 commits April 23, 2020 14:17
Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
@olix0r olix0r closed this Sep 10, 2020
@olix0r
Copy link
Member

olix0r commented Sep 10, 2020

This was accidentally closed when we changed the default branch to main. The base should probably be updated...

@olix0r olix0r reopened this Sep 10, 2020
@siggy
Copy link
Member

siggy commented Dec 9, 2020

Noting that as we rethink our metrics and labeling, a nice goal would be to conform to OpenMetrics. One notable requirement is that all metrics with the same name (a MetricsFamily) have the same set of label names:
https://github.com/OpenObservability/OpenMetrics/blob/defac23507e1e78871f8de2648bd809427a1063d/specification/OpenMetrics.md#metric
https://github.com/prometheus/docs/blob/be5550b3759db4d134a9eea9b1a91de092dcc5a8/content/docs/instrumenting/writing_clientlibs.md#labels

Another goal should be to avoid missing metrics. If we start counting requests_total{workload_name="foo"}, we should initialize failed_requests_total{workload_name="foo"}:
https://github.com/prometheus/docs/blob/be5550b3759db4d134a9eea9b1a91de092dcc5a8/content/docs/practices/instrumentation.md#avoid-missing-metrics

More best practices:
https://prometheus.io/docs/instrumenting/writing_exporters/

@sdhoward
Copy link

sdhoward commented Jun 3, 2021

I'm just here to remind that people are still interested in this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants