Add trafficsplit metrics to CLI #3176

scottcarol · 2019-07-31T21:45:22Z

Now that Linkerd supports trafficsplit, we would like to return traffic split metrics in the UI so a user can easily get metrics about their trafficsplits and leaf services. See #3106 for CLI mocks.

Ideally we could use the stat API endpoint to return trafficsplit metrics if the user types linkerd stat ts. This is consistent with the current user experience and utilizes the complex and powerful stat endpoint. The downside is that the information we want to return (a list of trafficsplits, their apexes and leaf services, with metrics for each leaf service) is different than the information we currently return with stat, and we risk making the stat endpoint confusing with multiple if statements.

The alternative is to create a completely new command for traffic split metrics. The positives of a new command would be that we could build on this over time and tailor the endpoint to trafficsplits. The negatives are that users would have to learn a new command, and we would be recreating a lot of the logic of the stat endpoint but not closely enough to re-use the actual code.

This draft PR is a proof of concept way to get trafficsplit information from the stat API endpoint. It is meant to get high-level feedback for:

how api/public/stat_summary.go could deal with a request from linkerd stat ts
protobuf changes
using regex to match the leaf service name to authority name
architecture of the data returned

How to Run

Create a trafficsplit for an application in your default namespace -- files and directions here if needed
Confirm trafficsplit was created

kubectl get ts

Port forward Prometheus:

kubectl -n linkerd port-forward $(kubectl --namespace=linkerd get po --selector=linkerd.io/control-plane-component=prometheus -o jsonpath='{.items[*].metadata.name}') 9090:9090

Run the go server for public-api

~/linkerd2 $ go run controller/cmd/public-api/main.go --kubeconfig ~/.kube/config --prometheus-url http://localhost:9090 --log-level debug

Save these sample files to your linkerd2 directory: stat-ts.go (queries stat endpoint for trafficsplit info) and stat-deploy.go (queries stat endpoint for deployment info)
Run the files and compare

$ go run stat-ts.go
$ go run stat-deploy.go

l5d-bot · 2019-07-31T22:53:55Z

Integration test results for a4b8790: fail 😕
Log output: https://gist.github.com/fc90764de58e4cd0a0f109889e04ca67

adleong

Are there CLI changes that should be included in this PR?

As I mention in a comment, I think that we should be looking at "outbound" stats here.

I think that the logic for getting the traffic split metrics is different enough from stat_summary logic that it doesn't make sense to try to put these in the same methods. Instead, I'd try creating new methods for fetching the traffic split metics, but using/sharing the same underlying prometheus methods. This follows the same pattern as top_routes, which is distinct from stat_summary but shares the prometheus code.

It's also worth noting that there are a few different layers here and at each layer we can decide if we want to piggyback on stat or not:

The CLI subcommand: we can use linkerd stat trafficsplit or create a new CLI subcommand linkerd trafficsplit
The API: we can enhance the rpc StatSummary(StatSummaryRequest) API to allow for traffic split queries and responses, or create a new API
The implementation: upon receiving the API request, the public-api server could try to plumb trafficsplit requests through the existing methods, or send the request to new methods.

Based on how this looks so far, I think reusing the stat infrastructure for the CLI subcommand and API probably makes sense. I think the implementation is different enough that probably don't want to share the implementation.

controller/api/public/stat_summary.go

grampelberg · 2019-08-01T14:42:12Z

The CLI subcommand: we can use linkerd stat trafficsplit or create a new CLI subcommand linkerd trafficsplit

From a product perspective, I would like to either have this be linkerd stat ts or linkerd describe ts. linkerd trafficsplit was the command I started out avoiding in the first place.

As I mention in a comment, I think that we should be looking at "outbound" stats here.

+1, the concept of TrafficSplit requires it to be measured on the outbound.

l5d-bot · 2019-08-04T23:26:21Z

Integration test results for 2e6723f: fail 😕
Log output: https://gist.github.com/18b4ae38922273530cc4c1f9b476cf1d

scottcarol · 2019-08-06T21:41:18Z

When querying Prometheus for trafficsplit stats, we will utilize the proxy changes from proxy PR 296. The details of that PR are not finalized and will likely result in some changes to the Prometheus queries in this PR.

For now, you can test this PR with your own trafficsplit, using @adleong's proxy branch:
linkerd upgrade --proxy-version alex-dst-metrics-3 (be sure to delete all pods in the namespace with the trafficsplit before testing)

If you need a trafficsplit setup you can use mine.

Note: This PR adds trafficsplit tests to /cli/cmd/stat_test.go. I am adding trafficsplit tests to /cli/cmd/stat_test.go in a separate PR.

l5d-bot · 2019-08-06T22:25:25Z

Integration test results for e2ed14e: success 🎉
Log output: https://gist.github.com/ba0da47d6f5f24a751f8d07084118022

l5d-bot · 2019-08-07T00:29:12Z

Integration test results for 410c462: success 🎉
Log output: https://gist.github.com/68c7bfa16af501e6d3f5a8b344bee265

l5d-bot · 2019-08-07T21:51:05Z

Integration test results for f7e19a0: fail 😕
Log output: https://gist.github.com/994be61ffb888f26f771513def6ded93

cli/cmd/stat.go

adleong · 2019-08-07T20:18:38Z

controller/api/public/prometheus.go

+// insert a regex-match check into a LabelSet for labels that match the provided
+// string. this is modeled on generateLabelStringWithExclusion().
+// the regex will match the first section of the string before a period, ex.
+// `dst="authors.default.svc.cluster.local:7001"` will return true for a stringToMatch of `authors`


I think we probably need to match the second section against the service namespace as well. In fact, we can construct the entire authority exactly except for the port. So I think the regex can just be: authority=~"%s.%s.svc.%s:\d+" or something like that

In an authority authors.default.svc.cluster.local:7001 we have authors from the apex service name and default from namespace. svc is a given. will an authority always have cluster and local as part of the address?

cluster.local is harded as the cluster domain in a number of places but there is currently ongoing work to replace that hardcoded value with the cluster domain from the Linkerd configmap.

Thanks for the context. Given this ongoing work, should I avoid hardcoding cluster.local into the authority address, and if so, what's the best way to extract the cluster domain so I can ensure I'm checking for the right string?

I'm now matching to authority=~"^authors.default.svc.+" using the apex service and the service namespace.

if you look at controller/cmd/destination/main.go you can see how the cluster domain is loaded from the global config at startup, and then passed into destination.newServer. We could do the same thing in the public-api main and pass the cluster domain into NewServer.

Great tip, thanks. To keep this PR trafficsplit-specific, I'd prefer to do that in a separate PR after this is merged, and leave the current regex matching as is. WDYT?

controller/api/public/stat_summary.go

controller/k8s/api.go

l5d-bot · 2019-08-10T02:31:38Z

Integration test results for 7f74c3b: success 🎉
Log output: https://gist.github.com/5071dcec8ec4c41ce24e08b197b30ff3

l5d-bot · 2019-08-12T03:14:57Z

Integration test results for 6f13c20: success 🎉
Log output: https://gist.github.com/178950711631ba3e87f8e97733149a9a

l5d-bot · 2019-08-12T17:36:00Z

Integration test results for a4e41f6: fail 😕
Log output: https://gist.github.com/bb41bd4f2eebe008eab7320d652945d5

adleong

Nice job with the refactor, I think putting the traffic split logic into its own function makes things a lot clearer.

cli/cmd/stat.go

controller/api/public/stat_summary.go

adleong · 2019-08-12T19:42:40Z

controller/api/public/stat_summary.go

+		labels = labels.Merge(promDirectionLabels("outbound"))
+	}
+
+	labelNames[1] = model.LabelName("dst_service") // replacing "trafficsplit" with "dst_service"


I'm a bit confused on what this is doing.

I refactored this, let me know if it's more clear. buildTrafficSplitRequestLabels handles four different query scenarios, and needs to output different labels for each:

stat ts -A (namespace is not specified)
stat ts -n foo
stat ts --from resource/name
stat ts --to resource/name

controller/api/public/stat_summary.go

controller/api/public/prometheus.go

grampelberg · 2019-08-12T21:08:20Z

@scottcarol did you mean to have this still be a draft PR?

adleong

This is looking really good!

cli/cmd/stat.go

cli/cmd/testdata/stat_one_ts_output_json.golden

controller/api/public/stat_summary.go

cli/cmd/stat.go

l5d-bot · 2019-08-13T18:46:18Z

Integration test results for 4dc704c: success 🎉
Log output: https://gist.github.com/be81c2f75cb139194a1337ed715a98c4

l5d-bot · 2019-08-13T21:23:09Z

Integration test results for 1e85337: fail 😕
Log output: https://gist.github.com/de08b0be5fe400970f01f57073f71053

adleong

⭐ Nice work! Code looks great to me! I will test on the CLI a bit also...

scottcarol · 2019-08-13T21:27:55Z

@grampelberg have you gotten a chance to test this out?

grampelberg · 2019-08-13T21:30:22Z

On it!

adleong · 2019-08-13T21:36:22Z

(edit: disregard this. user error.)

l5d-bot · 2019-08-14T00:48:57Z

Integration test results for 8c75e30: fail 😕
Log output: https://gist.github.com/059426bdecd14a679121dd60b389771a

klingerf

Looks really great. I just had some minor feedback and a few organizational comments, but the implementation is solid. Nice work!

controller/api/public/stat_summary.go

cli/cmd/stat.go

cli/cmd/testdata/stat_one_ts_output_json.golden

klingerf · 2019-08-14T02:27:39Z

controller/api/public/stat_summary.go

+					Type:      req.GetSelector().GetResource().GetType(),
+				},
+				TimeWindow: req.TimeWindow,
+				Stats:      tsBasicStats[currentLeaf],


Ok this is super subtle, but I don't think we want this API to return rows for traffic splits that don't have any corresponding metrics in prometheus, when a --to or --from flag is specified.

By way of example, in my default namespace, I'm running the books app, with an additional books service (books-v2). When I query for authorities, I see:

$ bin/linkerd stat au NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN - 100.00% 0.5rps 3ms 9ms 10ms - authors.default.svc.cluster.local:7001 - 77.09% 6.0rps 7ms 35ms 70ms - books-v2.default.svc.cluster.local:7002 - 75.00% 1.2rps 13ms 125ms 185ms - books.default.svc.cluster.local:7002 - 81.70% 6.5rps 9ms 86ms 97ms - webapp.default.svc.cluster.local:7000 - 78.70% 6.4rps 32ms 95ms 171ms -

If I then add a --from flag, that filters the rows returned to only those that are called by the from resource. For instance:

$ bin/linkerd stat au --from deploy/webapp NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN authors.default.svc.cluster.local:7001 - 100.00% 3.1rps 12ms 47ms 86ms - books.default.svc.cluster.local:7002 - 77.96% 7.0rps 12ms 83ms 97ms -

Or:

$ bin/linkerd stat au --from deploy/books NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN authors.default.svc.cluster.local:7001 - 46.60% 3.2rps 5ms 11ms 18ms -

But that's not the behavior that we're seeing with traffic splits. If I query for all traffic splits in my default namespace, I see:

$ bin/linkerd stat ts NAME APEX LEAF WEIGHT SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 books-split books books 800m 79.63% 6.4rps 12ms 75ms 120ms books-split books books-v2 200m 77.00% 1.7rps 20ms 33ms 39ms

If I add a --to flag, however, to filter to only those traffic splits that are hitting svc/books-v2, I see:

$ bin/linkerd stat ts --to svc/books-v2 NAME APEX LEAF WEIGHT SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 books-split books books 800m - - - - - books-split books books-v2 200m 77.11% 1.4rps 10ms 62ms 92ms

Notice how the books row still shows up, even though it doesn't have any stats. For parity with how stat works for other resources, I think we should drop that row.

And the good news is that there's a very straightforward fix for this, by changing this func as follows:

diff --git a/controller/api/public/stat_summary.go b/controller/api/public/stat_summary.go index 3763c2b9..5f0e900d 100644 --- a/controller/api/public/stat_summary.go +++ b/controller/api/public/stat_summary.go @@ -310,6 +310,13 @@ func (s *grpcServer) trafficSplitResourceQuery(ctx context.Context, req *pb.Stat Apex: tsStats.apex, Leaf: name, } + stats, ok := tsBasicStats[currentLeaf] + if !ok && !req.SkipStats { + switch req.Outbound.(type) { + case *pb.StatSummaryRequest_ToResource, *pb.StatSummaryRequest_FromResource: + continue + } + } trafficSplitStats := &pb.TrafficSplitStats{ Apex: tsStats.apex, @@ -324,7 +331,7 @@ func (s *grpcServer) trafficSplitResourceQuery(ctx context.Context, req *pb.Stat Type: req.GetSelector().GetResource().GetType(), }, TimeWindow: req.TimeWindow, - Stats: tsBasicStats[currentLeaf], + Stats: stats, TsStats: trafficSplitStats, } rows = append(rows, &row)

Thanks for testing this out and thinking about this. You make a good point that there's not exact parity here between ts and other resources.

My preference is to continue showing a row for each leaf within a trafficsplit even if there are no metrics returned, because in the case of traffic splitting, the absence of traffic is important information to validate that the split is working.

Also, if you are running watch linkerd stat ts as you apply weight changes within a split, showing a row for each leaf maintains a consistent display even if a leaf service you are phasing out goes down to 0 (or up from 0), which is more readable than a row popping in or out of view. WDYT?

You're right that we'll need to be specific in the descriptions about what --from and --to means in the case of linkerd stat ts. This is a very wordy attempt that needs to be more concise:

linkerd stat ts —from deploy/authors Show all trafficsplit leaf services in the default namespace, and metrics for any traffic coming to those services from the authors deployment.

Personally, I also think it's important to show the traffic split branches with no metrics even though this is inconsistent with how other resources behave.

Ok, this all sounds good to me. I just wanted to clarify that if a traffic split has ever received any traffic that matches the query pattern, then it will still be displayed, even if it currently is not receiving any traffic. My proposed change would only exclude rows that have never seen a single request matching the query pattern.

But I agree that the nature of trafficsplits is different enough from other resources to warrant showing all branches in the stat output.

adleong

I've played with this a bunch and it's all behaving as I would expect. I also tried running the a stable-2.4 CLI against a cluster with this change and vice versa and didn't see anything unexpected.
👍

l5d-bot · 2019-08-14T04:48:27Z

Integration test results for 1507ba5: fail 😕
Log output: https://gist.github.com/7f24c75d03e82b54d33baf3f59bf000d

grampelberg · 2019-08-14T16:13:41Z

LGTM! 🚢

klingerf

⭐️ 📈 Thanks for addressing my feedback! This looks great to me.

scottcarol requested a review from adleong July 31, 2019 21:45

scottcarol changed the title ~~Cscott/demo ts metrics~~ Add trafficsplit metrics to CLI Jul 31, 2019

adleong reviewed Aug 1, 2019

View reviewed changes

scottcarol force-pushed the cscott/demo-ts-metrics branch from a4b8790 to 2e6723f Compare August 4, 2019 23:10

adleong reviewed Aug 7, 2019

View reviewed changes

scottcarol requested a review from klingerf August 12, 2019 17:02

scottcarol self-assigned this Aug 12, 2019

scottcarol added the area/cli label Aug 12, 2019

adleong reviewed Aug 12, 2019

View reviewed changes

scottcarol marked this pull request as ready for review August 12, 2019 21:46

adleong reviewed Aug 13, 2019

View reviewed changes

adleong approved these changes Aug 13, 2019

View reviewed changes

Carol Scott added 15 commits August 13, 2019 17:15

removing unnecessary TCP connections column

6ee9c19

Unpacking traffic split info with *v1alpha1.TrafficSplit

a22b174

Incorporating Alex feedback

51461de

Incorporating stat.go feedback

a4844dd

Creating buildTrafficSplitRequestLabels()

6a1076f

More review feedback

da1bdf6

removing TS type check

f29257a

Checking for both apex and namespace in Prometheus query

fd58c63

Removing unused comment

c3e4695

fix lint error

6e7a979

saving so far

228f65a

Simplifying

1ffcb1f

Removing extra line breaks

e1eafac

Restoring original line breaks to remove review noise

4c2d9a7

rebasing

8c75e30

scottcarol force-pushed the cscott/demo-ts-metrics branch from a7944f1 to 8c75e30 Compare August 14, 2019 00:15

klingerf reviewed Aug 14, 2019

View reviewed changes

Incorporating KL feedback and adding help text

1507ba5

adleong approved these changes Aug 14, 2019

View reviewed changes

klingerf approved these changes Aug 14, 2019

View reviewed changes

scottcarol merged commit 0043770 into master Aug 14, 2019

scottcarol deleted the cscott/demo-ts-metrics branch August 14, 2019 17:31

This was referenced Aug 14, 2019

Add traffic split info to CLI #3106

Closed

Modify stat API endpoint for Traffic Split view #3105

Closed

Add Trafficsplit stats to dashboard #3264

Closed

scottcarol mentioned this pull request Aug 21, 2019

TrafficSplit dashboard integration #3261

Closed

Add trafficsplit metrics to CLI #3176

Add trafficsplit metrics to CLI #3176

Conversation

scottcarol commented Jul 31, 2019

l5d-bot commented Jul 31, 2019

adleong left a comment

Choose a reason for hiding this comment

grampelberg commented Aug 1, 2019

l5d-bot commented Aug 4, 2019

scottcarol commented Aug 6, 2019

l5d-bot commented Aug 6, 2019

l5d-bot commented Aug 7, 2019

l5d-bot commented Aug 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

l5d-bot commented Aug 10, 2019

l5d-bot commented Aug 12, 2019

l5d-bot commented Aug 12, 2019

adleong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grampelberg commented Aug 12, 2019

adleong left a comment

Choose a reason for hiding this comment

l5d-bot commented Aug 13, 2019

l5d-bot commented Aug 13, 2019

adleong left a comment

Choose a reason for hiding this comment

scottcarol commented Aug 13, 2019

grampelberg commented Aug 13, 2019

adleong commented Aug 13, 2019 • edited Loading

l5d-bot commented Aug 14, 2019

klingerf left a comment

Choose a reason for hiding this comment

klingerf Aug 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adleong left a comment

Choose a reason for hiding this comment

l5d-bot commented Aug 14, 2019

grampelberg commented Aug 14, 2019

klingerf left a comment

Choose a reason for hiding this comment

adleong commented Aug 13, 2019 •

edited

Loading

klingerf Aug 14, 2019 •

edited

Loading