Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trafficsplit metrics to CLI #3176

Merged
merged 21 commits into from
Aug 14, 2019
Merged

Add trafficsplit metrics to CLI #3176

merged 21 commits into from
Aug 14, 2019

Conversation

scottcarol
Copy link
Contributor

Now that Linkerd supports trafficsplit, we would like to return traffic split metrics in the UI so a user can easily get metrics about their trafficsplits and leaf services. See #3106 for CLI mocks.

Ideally we could use the stat API endpoint to return trafficsplit metrics if the user types linkerd stat ts. This is consistent with the current user experience and utilizes the complex and powerful stat endpoint. The downside is that the information we want to return (a list of trafficsplits, their apexes and leaf services, with metrics for each leaf service) is different than the information we currently return with stat, and we risk making the stat endpoint confusing with multiple if statements.

The alternative is to create a completely new command for traffic split metrics. The positives of a new command would be that we could build on this over time and tailor the endpoint to trafficsplits. The negatives are that users would have to learn a new command, and we would be recreating a lot of the logic of the stat endpoint but not closely enough to re-use the actual code.

This draft PR is a proof of concept way to get trafficsplit information from the stat API endpoint. It is meant to get high-level feedback for:

  • how api/public/stat_summary.go could deal with a request from linkerd stat ts
  • protobuf changes
  • using regex to match the leaf service name to authority name
  • architecture of the data returned

How to Run

  1. Create a trafficsplit for an application in your default namespace -- files and directions here if needed
  2. Confirm trafficsplit was created
kubectl get ts
  1. Port forward Prometheus:
kubectl -n linkerd port-forward $(kubectl --namespace=linkerd get po --selector=linkerd.io/control-plane-component=prometheus -o jsonpath='{.items[*].metadata.name}') 9090:9090
  1. Run the go server for public-api
~/linkerd2 $ go run controller/cmd/public-api/main.go --kubeconfig ~/.kube/config --prometheus-url http://localhost:9090 --log-level debug
  1. Save these sample files to your linkerd2 directory: stat-ts.go (queries stat endpoint for trafficsplit info) and stat-deploy.go (queries stat endpoint for deployment info)
  2. Run the files and compare
$ go run stat-ts.go
$ go run stat-deploy.go

@scottcarol scottcarol requested a review from adleong July 31, 2019 21:45
@l5d-bot
Copy link
Collaborator

l5d-bot commented Jul 31, 2019

Integration test results for a4b8790: fail 😕
Log output: https://gist.github.com/fc90764de58e4cd0a0f109889e04ca67

@scottcarol scottcarol changed the title Cscott/demo ts metrics Add trafficsplit metrics to CLI Jul 31, 2019
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there CLI changes that should be included in this PR?

As I mention in a comment, I think that we should be looking at "outbound" stats here.

I think that the logic for getting the traffic split metrics is different enough from stat_summary logic that it doesn't make sense to try to put these in the same methods. Instead, I'd try creating new methods for fetching the traffic split metics, but using/sharing the same underlying prometheus methods. This follows the same pattern as top_routes, which is distinct from stat_summary but shares the prometheus code.

It's also worth noting that there are a few different layers here and at each layer we can decide if we want to piggyback on stat or not:

  • The CLI subcommand: we can use linkerd stat trafficsplit or create a new CLI subcommand linkerd trafficsplit
  • The API: we can enhance the rpc StatSummary(StatSummaryRequest) API to allow for traffic split queries and responses, or create a new API
  • The implementation: upon receiving the API request, the public-api server could try to plumb trafficsplit requests through the existing methods, or send the request to new methods.

Based on how this looks so far, I think reusing the stat infrastructure for the CLI subcommand and API probably makes sense. I think the implementation is different enough that probably don't want to share the implementation.

controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
@grampelberg
Copy link
Contributor

The CLI subcommand: we can use linkerd stat trafficsplit or create a new CLI subcommand linkerd trafficsplit

From a product perspective, I would like to either have this be linkerd stat ts or linkerd describe ts. linkerd trafficsplit was the command I started out avoiding in the first place.

As I mention in a comment, I think that we should be looking at "outbound" stats here.

+1, the concept of TrafficSplit requires it to be measured on the outbound.

@scottcarol scottcarol force-pushed the cscott/demo-ts-metrics branch from a4b8790 to 2e6723f Compare August 4, 2019 23:10
@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 4, 2019

Integration test results for 2e6723f: fail 😕
Log output: https://gist.github.com/18b4ae38922273530cc4c1f9b476cf1d

@scottcarol
Copy link
Contributor Author

When querying Prometheus for trafficsplit stats, we will utilize the proxy changes from proxy PR 296. The details of that PR are not finalized and will likely result in some changes to the Prometheus queries in this PR.

For now, you can test this PR with your own trafficsplit, using @adleong's proxy branch:
linkerd upgrade --proxy-version alex-dst-metrics-3 (be sure to delete all pods in the namespace with the trafficsplit before testing)

If you need a trafficsplit setup you can use mine.

Note: This PR adds trafficsplit tests to /cli/cmd/stat_test.go. I am adding trafficsplit tests to /cli/cmd/stat_test.go in a separate PR.

@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 6, 2019

Integration test results for e2ed14e: success 🎉
Log output: https://gist.github.com/ba0da47d6f5f24a751f8d07084118022

@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 7, 2019

Integration test results for 410c462: success 🎉
Log output: https://gist.github.com/68c7bfa16af501e6d3f5a8b344bee265

@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 7, 2019

Integration test results for f7e19a0: fail 😕
Log output: https://gist.github.com/994be61ffb888f26f771513def6ded93

cli/cmd/stat.go Outdated Show resolved Hide resolved
cli/cmd/stat.go Outdated Show resolved Hide resolved
cli/cmd/stat.go Outdated Show resolved Hide resolved
// insert a regex-match check into a LabelSet for labels that match the provided
// string. this is modeled on generateLabelStringWithExclusion().
// the regex will match the first section of the string before a period, ex.
// `dst="authors.default.svc.cluster.local:7001"` will return true for a stringToMatch of `authors`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably need to match the second section against the service namespace as well. In fact, we can construct the entire authority exactly except for the port. So I think the regex can just be: authority=~"%s.%s.svc.%s:\d+" or something like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an authority authors.default.svc.cluster.local:7001 we have authors from the apex service name and default from namespace. svc is a given. will an authority always have cluster and local as part of the address?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster.local is harded as the cluster domain in a number of places but there is currently ongoing work to replace that hardcoded value with the cluster domain from the Linkerd configmap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. Given this ongoing work, should I avoid hardcoding cluster.local into the authority address, and if so, what's the best way to extract the cluster domain so I can ensure I'm checking for the right string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now matching to authority=~"^authors.default.svc.+" using the apex service and the service namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at controller/cmd/destination/main.go you can see how the cluster domain is loaded from the global config at startup, and then passed into destination.newServer. We could do the same thing in the public-api main and pass the cluster domain into NewServer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tip, thanks. To keep this PR trafficsplit-specific, I'd prefer to do that in a separate PR after this is merged, and leave the current regex matching as is. WDYT?

controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/k8s/api.go Outdated Show resolved Hide resolved
@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 10, 2019

Integration test results for 7f74c3b: success 🎉
Log output: https://gist.github.com/5071dcec8ec4c41ce24e08b197b30ff3

@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 12, 2019

Integration test results for 6f13c20: success 🎉
Log output: https://gist.github.com/178950711631ba3e87f8e97733149a9a

@scottcarol scottcarol requested a review from klingerf August 12, 2019 17:02
@scottcarol scottcarol self-assigned this Aug 12, 2019
@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 12, 2019

Integration test results for a4e41f6: fail 😕
Log output: https://gist.github.com/bb41bd4f2eebe008eab7320d652945d5

Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job with the refactor, I think putting the traffic split logic into its own function makes things a lot clearer.

cli/cmd/stat.go Show resolved Hide resolved
cli/cmd/stat.go Outdated Show resolved Hide resolved
cli/cmd/stat.go Outdated Show resolved Hide resolved
cli/cmd/stat.go Outdated Show resolved Hide resolved
cli/cmd/stat.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
labels = labels.Merge(promDirectionLabels("outbound"))
}

labelNames[1] = model.LabelName("dst_service") // replacing "trafficsplit" with "dst_service"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused on what this is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this, let me know if it's more clear. buildTrafficSplitRequestLabels handles four different query scenarios, and needs to output different labels for each:

stat ts -A (namespace is not specified)
stat ts -n foo
stat ts --from resource/name
stat ts --to resource/name

controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/prometheus.go Outdated Show resolved Hide resolved
@grampelberg
Copy link
Contributor

@scottcarol did you mean to have this still be a draft PR?

@scottcarol scottcarol marked this pull request as ready for review August 12, 2019 21:46
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good!

cli/cmd/stat.go Outdated Show resolved Hide resolved
cli/cmd/testdata/stat_one_ts_output_json.golden Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Show resolved Hide resolved
controller/api/public/stat_summary.go Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
cli/cmd/stat.go Outdated Show resolved Hide resolved
@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 13, 2019

Integration test results for 4dc704c: success 🎉
Log output: https://gist.github.com/be81c2f75cb139194a1337ed715a98c4

@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 13, 2019

Integration test results for 1e85337: fail 😕
Log output: https://gist.github.com/de08b0be5fe400970f01f57073f71053

Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐ Nice work! Code looks great to me! I will test on the CLI a bit also...

@scottcarol
Copy link
Contributor Author

@grampelberg have you gotten a chance to test this out?

@grampelberg
Copy link
Contributor

On it!

@adleong
Copy link
Member

adleong commented Aug 13, 2019

(edit: disregard this. user error.)

@scottcarol scottcarol force-pushed the cscott/demo-ts-metrics branch from a7944f1 to 8c75e30 Compare August 14, 2019 00:15
@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 14, 2019

Integration test results for 8c75e30: fail 😕
Log output: https://gist.github.com/059426bdecd14a679121dd60b389771a

Copy link
Contributor

@klingerf klingerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great. I just had some minor feedback and a few organizational comments, but the implementation is solid. Nice work!

controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
controller/api/public/stat_summary.go Outdated Show resolved Hide resolved
cli/cmd/stat.go Outdated Show resolved Hide resolved
cli/cmd/testdata/stat_one_ts_output_json.golden Outdated Show resolved Hide resolved
Type: req.GetSelector().GetResource().GetType(),
},
TimeWindow: req.TimeWindow,
Stats: tsBasicStats[currentLeaf],
Copy link
Contributor

@klingerf klingerf Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is super subtle, but I don't think we want this API to return rows for traffic splits that don't have any corresponding metrics in prometheus, when a --to or --from flag is specified.

By way of example, in my default namespace, I'm running the books app, with an additional books service (books-v2). When I query for authorities, I see:

$ bin/linkerd stat au
NAME                                      MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
                                               -   100.00%   0.5rps           3ms           9ms          10ms          -
authors.default.svc.cluster.local:7001         -    77.09%   6.0rps           7ms          35ms          70ms          -
books-v2.default.svc.cluster.local:7002        -    75.00%   1.2rps          13ms         125ms         185ms          -
books.default.svc.cluster.local:7002           -    81.70%   6.5rps           9ms          86ms          97ms          -
webapp.default.svc.cluster.local:7000          -    78.70%   6.4rps          32ms          95ms         171ms          -

If I then add a --from flag, that filters the rows returned to only those that are called by the from resource. For instance:

$ bin/linkerd stat au --from deploy/webapp
NAME                                     MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
authors.default.svc.cluster.local:7001        -   100.00%   3.1rps          12ms          47ms          86ms          -
books.default.svc.cluster.local:7002          -    77.96%   7.0rps          12ms          83ms          97ms          -

Or:

$ bin/linkerd stat au --from deploy/books
NAME                                     MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
authors.default.svc.cluster.local:7001        -    46.60%   3.2rps           5ms          11ms          18ms          -

But that's not the behavior that we're seeing with traffic splits. If I query for all traffic splits in my default namespace, I see:

$ bin/linkerd stat ts
NAME          APEX    LEAF       WEIGHT   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
books-split   books   books        800m    79.63%   6.4rps          12ms          75ms         120ms
books-split   books   books-v2     200m    77.00%   1.7rps          20ms          33ms          39ms

If I add a --to flag, however, to filter to only those traffic splits that are hitting svc/books-v2, I see:

$ bin/linkerd stat ts --to svc/books-v2
NAME          APEX    LEAF       WEIGHT   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
books-split   books   books        800m         -        -             -             -             -
books-split   books   books-v2     200m    77.11%   1.4rps          10ms          62ms          92ms

Notice how the books row still shows up, even though it doesn't have any stats. For parity with how stat works for other resources, I think we should drop that row.

And the good news is that there's a very straightforward fix for this, by changing this func as follows:

diff --git a/controller/api/public/stat_summary.go b/controller/api/public/stat_summary.go
index 3763c2b9..5f0e900d 100644
--- a/controller/api/public/stat_summary.go
+++ b/controller/api/public/stat_summary.go
@@ -310,6 +310,13 @@ func (s *grpcServer) trafficSplitResourceQuery(ctx context.Context, req *pb.Stat
 				Apex:      tsStats.apex,
 				Leaf:      name,
 			}
+			stats, ok := tsBasicStats[currentLeaf]
+			if !ok && !req.SkipStats {
+				switch req.Outbound.(type) {
+				case *pb.StatSummaryRequest_ToResource, *pb.StatSummaryRequest_FromResource:
+					continue
+				}
+			}
 
 			trafficSplitStats := &pb.TrafficSplitStats{
 				Apex:   tsStats.apex,
@@ -324,7 +331,7 @@ func (s *grpcServer) trafficSplitResourceQuery(ctx context.Context, req *pb.Stat
 					Type:      req.GetSelector().GetResource().GetType(),
 				},
 				TimeWindow: req.TimeWindow,
-				Stats:      tsBasicStats[currentLeaf],
+				Stats:      stats,
 				TsStats:    trafficSplitStats,
 			}
 			rows = append(rows, &row)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing this out and thinking about this. You make a good point that there's not exact parity here between ts and other resources.

My preference is to continue showing a row for each leaf within a trafficsplit even if there are no metrics returned, because in the case of traffic splitting, the absence of traffic is important information to validate that the split is working.

Also, if you are running watch linkerd stat ts as you apply weight changes within a split, showing a row for each leaf maintains a consistent display even if a leaf service you are phasing out goes down to 0 (or up from 0), which is more readable than a row popping in or out of view. WDYT?

You're right that we'll need to be specific in the descriptions about what --from and --to means in the case of linkerd stat ts. This is a very wordy attempt that needs to be more concise:

linkerd stat ts —from deploy/authors
Show all trafficsplit leaf services in the default namespace, and metrics for any traffic coming to those services from the authors deployment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I also think it's important to show the traffic split branches with no metrics even though this is inconsistent with how other resources behave.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this all sounds good to me. I just wanted to clarify that if a traffic split has ever received any traffic that matches the query pattern, then it will still be displayed, even if it currently is not receiving any traffic. My proposed change would only exclude rows that have never seen a single request matching the query pattern.

But I agree that the nature of trafficsplits is different enough from other resources to warrant showing all branches in the stat output.

Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played with this a bunch and it's all behaving as I would expect. I also tried running the a stable-2.4 CLI against a cluster with this change and vice versa and didn't see anything unexpected.
👍

@l5d-bot
Copy link
Collaborator

l5d-bot commented Aug 14, 2019

Integration test results for 1507ba5: fail 😕
Log output: https://gist.github.com/7f24c75d03e82b54d33baf3f59bf000d

@grampelberg
Copy link
Contributor

LGTM! 🚢

Copy link
Contributor

@klingerf klingerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐️ 📈 Thanks for addressing my feedback! This looks great to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants