-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add trafficsplit metrics to CLI #3176
Conversation
Integration test results for a4b8790: fail 😕 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there CLI changes that should be included in this PR?
As I mention in a comment, I think that we should be looking at "outbound" stats here.
I think that the logic for getting the traffic split metrics is different enough from stat_summary logic that it doesn't make sense to try to put these in the same methods. Instead, I'd try creating new methods for fetching the traffic split metics, but using/sharing the same underlying prometheus methods. This follows the same pattern as top_routes, which is distinct from stat_summary but shares the prometheus code.
It's also worth noting that there are a few different layers here and at each layer we can decide if we want to piggyback on stat
or not:
- The CLI subcommand: we can use
linkerd stat trafficsplit
or create a new CLI subcommandlinkerd trafficsplit
- The API: we can enhance the
rpc StatSummary(StatSummaryRequest)
API to allow for traffic split queries and responses, or create a new API - The implementation: upon receiving the API request, the public-api server could try to plumb trafficsplit requests through the existing methods, or send the request to new methods.
Based on how this looks so far, I think reusing the stat
infrastructure for the CLI subcommand and API probably makes sense. I think the implementation is different enough that probably don't want to share the implementation.
From a product perspective, I would like to either have this be
+1, the concept of TrafficSplit requires it to be measured on the outbound. |
a4b8790
to
2e6723f
Compare
Integration test results for 2e6723f: fail 😕 |
When querying Prometheus for For now, you can test this PR with your own trafficsplit, using @adleong's proxy branch: If you need a trafficsplit setup you can use mine. Note: This PR adds trafficsplit tests to |
Integration test results for e2ed14e: success 🎉 |
Integration test results for 410c462: success 🎉 |
Integration test results for f7e19a0: fail 😕 |
controller/api/public/prometheus.go
Outdated
// insert a regex-match check into a LabelSet for labels that match the provided | ||
// string. this is modeled on generateLabelStringWithExclusion(). | ||
// the regex will match the first section of the string before a period, ex. | ||
// `dst="authors.default.svc.cluster.local:7001"` will return true for a stringToMatch of `authors` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably need to match the second section against the service namespace as well. In fact, we can construct the entire authority exactly except for the port. So I think the regex can just be: authority=~"%s.%s.svc.%s:\d+"
or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an authority authors.default.svc.cluster.local:7001
we have authors
from the apex service name and default
from namespace. svc
is a given. will an authority always have cluster
and local
as part of the address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster.local
is harded as the cluster domain in a number of places but there is currently ongoing work to replace that hardcoded value with the cluster domain from the Linkerd configmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context. Given this ongoing work, should I avoid hardcoding cluster.local
into the authority address, and if so, what's the best way to extract the cluster domain so I can ensure I'm checking for the right string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now matching to authority=~"^authors.default.svc.+"
using the apex service and the service namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you look at controller/cmd/destination/main.go
you can see how the cluster domain is loaded from the global config at startup, and then passed into destination.newServer
. We could do the same thing in the public-api main and pass the cluster domain into NewServer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great tip, thanks. To keep this PR trafficsplit-specific, I'd prefer to do that in a separate PR after this is merged, and leave the current regex matching as is. WDYT?
Integration test results for 7f74c3b: success 🎉 |
Integration test results for 6f13c20: success 🎉 |
Integration test results for a4e41f6: fail 😕 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job with the refactor, I think putting the traffic split logic into its own function makes things a lot clearer.
labels = labels.Merge(promDirectionLabels("outbound")) | ||
} | ||
|
||
labelNames[1] = model.LabelName("dst_service") // replacing "trafficsplit" with "dst_service" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused on what this is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this, let me know if it's more clear. buildTrafficSplitRequestLabels
handles four different query scenarios, and needs to output different labels for each:
stat ts -A
(namespace is not specified)
stat ts -n foo
stat ts --from resource/name
stat ts --to resource/name
@scottcarol did you mean to have this still be a draft PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good!
Integration test results for 4dc704c: success 🎉 |
Integration test results for 1e85337: fail 😕 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⭐ Nice work! Code looks great to me! I will test on the CLI a bit also...
@grampelberg have you gotten a chance to test this out? |
On it! |
(edit: disregard this. user error.) |
a7944f1
to
8c75e30
Compare
Integration test results for 8c75e30: fail 😕 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really great. I just had some minor feedback and a few organizational comments, but the implementation is solid. Nice work!
Type: req.GetSelector().GetResource().GetType(), | ||
}, | ||
TimeWindow: req.TimeWindow, | ||
Stats: tsBasicStats[currentLeaf], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this is super subtle, but I don't think we want this API to return rows for traffic splits that don't have any corresponding metrics in prometheus, when a --to
or --from
flag is specified.
By way of example, in my default
namespace, I'm running the books app, with an additional books service (books-v2
). When I query for authorities, I see:
$ bin/linkerd stat au
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
- 100.00% 0.5rps 3ms 9ms 10ms -
authors.default.svc.cluster.local:7001 - 77.09% 6.0rps 7ms 35ms 70ms -
books-v2.default.svc.cluster.local:7002 - 75.00% 1.2rps 13ms 125ms 185ms -
books.default.svc.cluster.local:7002 - 81.70% 6.5rps 9ms 86ms 97ms -
webapp.default.svc.cluster.local:7000 - 78.70% 6.4rps 32ms 95ms 171ms -
If I then add a --from
flag, that filters the rows returned to only those that are called by the from resource. For instance:
$ bin/linkerd stat au --from deploy/webapp
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
authors.default.svc.cluster.local:7001 - 100.00% 3.1rps 12ms 47ms 86ms -
books.default.svc.cluster.local:7002 - 77.96% 7.0rps 12ms 83ms 97ms -
Or:
$ bin/linkerd stat au --from deploy/books
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
authors.default.svc.cluster.local:7001 - 46.60% 3.2rps 5ms 11ms 18ms -
But that's not the behavior that we're seeing with traffic splits. If I query for all traffic splits in my default namespace, I see:
$ bin/linkerd stat ts
NAME APEX LEAF WEIGHT SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
books-split books books 800m 79.63% 6.4rps 12ms 75ms 120ms
books-split books books-v2 200m 77.00% 1.7rps 20ms 33ms 39ms
If I add a --to
flag, however, to filter to only those traffic splits that are hitting svc/books-v2
, I see:
$ bin/linkerd stat ts --to svc/books-v2
NAME APEX LEAF WEIGHT SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
books-split books books 800m - - - - -
books-split books books-v2 200m 77.11% 1.4rps 10ms 62ms 92ms
Notice how the books
row still shows up, even though it doesn't have any stats. For parity with how stat works for other resources, I think we should drop that row.
And the good news is that there's a very straightforward fix for this, by changing this func as follows:
diff --git a/controller/api/public/stat_summary.go b/controller/api/public/stat_summary.go
index 3763c2b9..5f0e900d 100644
--- a/controller/api/public/stat_summary.go
+++ b/controller/api/public/stat_summary.go
@@ -310,6 +310,13 @@ func (s *grpcServer) trafficSplitResourceQuery(ctx context.Context, req *pb.Stat
Apex: tsStats.apex,
Leaf: name,
}
+ stats, ok := tsBasicStats[currentLeaf]
+ if !ok && !req.SkipStats {
+ switch req.Outbound.(type) {
+ case *pb.StatSummaryRequest_ToResource, *pb.StatSummaryRequest_FromResource:
+ continue
+ }
+ }
trafficSplitStats := &pb.TrafficSplitStats{
Apex: tsStats.apex,
@@ -324,7 +331,7 @@ func (s *grpcServer) trafficSplitResourceQuery(ctx context.Context, req *pb.Stat
Type: req.GetSelector().GetResource().GetType(),
},
TimeWindow: req.TimeWindow,
- Stats: tsBasicStats[currentLeaf],
+ Stats: stats,
TsStats: trafficSplitStats,
}
rows = append(rows, &row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing this out and thinking about this. You make a good point that there's not exact parity here between ts
and other resources.
My preference is to continue showing a row for each leaf within a trafficsplit even if there are no metrics returned, because in the case of traffic splitting, the absence of traffic is important information to validate that the split is working.
Also, if you are running watch linkerd stat ts
as you apply weight changes within a split, showing a row for each leaf maintains a consistent display even if a leaf service you are phasing out goes down to 0 (or up from 0), which is more readable than a row popping in or out of view. WDYT?
You're right that we'll need to be specific in the descriptions about what --from and --to means in the case of linkerd stat ts
. This is a very wordy attempt that needs to be more concise:
linkerd stat ts —from deploy/authors
Show all trafficsplit leaf services in the default namespace, and metrics for any traffic coming to those services from the authors deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I also think it's important to show the traffic split branches with no metrics even though this is inconsistent with how other resources behave.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this all sounds good to me. I just wanted to clarify that if a traffic split has ever received any traffic that matches the query pattern, then it will still be displayed, even if it currently is not receiving any traffic. My proposed change would only exclude rows that have never seen a single request matching the query pattern.
But I agree that the nature of trafficsplits is different enough from other resources to warrant showing all branches in the stat output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've played with this a bunch and it's all behaving as I would expect. I also tried running the a stable-2.4 CLI against a cluster with this change and vice versa and didn't see anything unexpected.
👍
Integration test results for 1507ba5: fail 😕 |
LGTM! 🚢 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⭐️ 📈 Thanks for addressing my feedback! This looks great to me.
Now that Linkerd supports trafficsplit, we would like to return traffic split metrics in the UI so a user can easily get metrics about their trafficsplits and leaf services. See #3106 for CLI mocks.
Ideally we could use the
stat
API endpoint to return trafficsplit metrics if the user typeslinkerd stat ts
. This is consistent with the current user experience and utilizes the complex and powerfulstat
endpoint. The downside is that the information we want to return (a list of trafficsplits, their apexes and leaf services, with metrics for each leaf service) is different than the information we currently return withstat
, and we risk making thestat
endpoint confusing with multipleif
statements.The alternative is to create a completely new command for traffic split metrics. The positives of a new command would be that we could build on this over time and tailor the endpoint to trafficsplits. The negatives are that users would have to learn a new command, and we would be recreating a lot of the logic of the
stat
endpoint but not closely enough to re-use the actual code.This draft PR is a proof of concept way to get
trafficsplit
information from thestat
API endpoint. It is meant to get high-level feedback for:api/public/stat_summary.go
could deal with a request fromlinkerd stat ts
How to Run
default
namespace -- files and directions here if neededkubectl -n linkerd port-forward $(kubectl --namespace=linkerd get po --selector=linkerd.io/control-plane-component=prometheus -o jsonpath='{.items[*].metadata.name}') 9090:9090