-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CONTP-545] notify leader election subscribers on leadership state change #32323
[CONTP-545] notify leader election subscribers on leadership state change #32323
Conversation
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 188825c Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | tcp_syslog_to_blackhole | ingress throughput | +2.10 | [+2.02, +2.17] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | +1.10 | [-1.87, +4.06] | 1 | Logs |
➖ | otel_to_otel_logs | ingress throughput | +0.98 | [+0.26, +1.69] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | +0.10 | [-0.68, +0.89] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | +0.08 | [-0.39, +0.54] | 1 | Logs |
➖ | quality_gate_idle | memory utilization | +0.03 | [-0.01, +0.08] | 1 | Logs bounds checks dashboard |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.03 | [-0.67, +0.74] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.07, +0.10] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.01, +0.01] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | -0.01 | [-0.86, +0.84] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | -0.01 | [-0.64, +0.62] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | -0.01 | [-0.78, +0.76] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | -0.05 | [-0.93, +0.84] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.08 | [-0.88, +0.72] | 1 | Logs |
➖ | file_tree | memory utilization | -0.20 | [-0.35, -0.06] | 1 | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -0.72 | [-1.44, +0.00] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | -0.94 | [-1.07, -0.80] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | |
✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
809ef2f
to
280a71a
Compare
4508ee2
to
3270cee
Compare
b741b1a
to
51bb592
Compare
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv aws.create-vm --pipeline-id=51468540 --os-family=ubuntu Note: This applies to commit 290d4ba |
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision |
51bb592
to
e6d9370
Compare
e6d9370
to
290d4ba
Compare
case <-c.leadershipStateNotif: | ||
if c.isLeaderFunc() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question: Are we certain that the isLeaderFunc()
return value is always updated before receiving the leadershipStateNotif
notification?
What I’d like to confirm is that there’s no potential race condition between these two lines. For example, the notification might be sent because the DCA instance became the leader, but this information hasn’t yet propagated to the c.isLeaderFunc()
output. This could result in skipping the execution of c.triggerReconciliation()
.
Would it make sense to add a log in the else case, stating: “Received a notification but not recognized as leader yet”?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notification is never sent before the leadership status is updated fully and propagated to isLeader
Here is where notify
is called. You can see that we synchronously call updateLeaderIdentity
before we notify subscribers.
Finally, IsLeader
has a very simple implementation. Once we updateLeaderIdentity
, the change is instantly visible in isLeader
func result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add a log in the else case, stating: “Received a notification but not recognized as leader yet”?
If it is not clear, the notification is not to tell the subscriber that the instance has become a leader. But rather to tell it that a change has occurred on the leadership state.
The subscriber is responsible to handle the logic based on the usecase, and isLeaderFunc
is provided to determine the live state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add, I already added unit tests for the subscription feature which validates that the change will have already been propagated to isLeaderFunc
when the notification is sent.
datadog-agent/pkg/util/kubernetes/apiserver/leaderelection/leaderelection_test.go
Lines 237 to 239 in 290d4ba
case <-notif1: | |
counter1++ | |
require.Truef(t, le.IsLeader(), "Expected to be leader") |
// Calling Subscribe is optional, use IsLeader if the client doesn't need an event-based approach. | ||
func (le *LeaderEngine) Subscribe() <-chan struct{} { | ||
c := make(chan struct{}, 5) // buffered channel to avoid blocking in case of stuck subscriber |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: How did this leadership election refactor address the possible blocking in the case of stuck subscribers. Do we still need to make the channel's buffer larger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the implementation of the notify function.
If the buffer is already full, no new notifications are placed on the channel.
We only notify on the subscriber's channel if the channel is already empty, so we will never get stuck when writing to the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha that makes sense. I'm trying to think of edge cases in leadership election (such as when leadership may be gained, lost, and then regained quickly) and how this might effect the overall state.
In such a case where
- The Cluster Agent X is elected leader and sends a notification to it's subscribers via
notify()
- The subscriber does not get CPU time to process the notification.
- The Cluster Agent X is then removed as leader and attempts to send a notification to it's subscribers via
notify()
- However in this case, since the channel is full (buffer len = 1), it will not send a second notification.
- The subscriber is finally scheduled on the CPU and only receives the first Leader Notification and doesn't know about the lost leadership notification.
The buffer length of 5 provided some resiliency in this type of situation I believe but also isn't perfect itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then regained quickly
The leadership lease duration is 60 seconds, so there is no such "quick" back-and-forth.
Regardless, it is still ok even in the case you explained:
and doesn't know about the lost leadership notification.
This is not quite accurate.
There no difference between the notification sent in case of acquisition of leadership and in case of loss of leaderhip.
In both cases, the notification is exactly the same, an empty struct struct{}{}
. A notification sent on the channel is not telling the subscriber whether or not the current instance is the leader. It is only telling the subscriber that there has been changes to the leadership state of the current instance. The subscriber can then determine live leadership state via IsLeaderFunc
which is returned also in the subscribe method.
In other words, these notifications are sorts of alerts to subscribers that tell them that they might need to check the new leadership state because there has been some changes. So, if the subscriber hasn't yet consumed the previous alert, it is pointless to send another identical alert. We can just keep the old alert and the subscriber will pick it up at some point in time when it has enough CPU.
This is made clear in the method comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation Adel about tracking changes.
So it seems in your implementation, if the leader cluster agent first sends a notification when it lost leadership, subscriber doesn't have time to read it, then sends another notification when it re-gains leadership, the subscriber may then get unblocked and process the notification but there has been no perceived change in leader. Whereas previously it would've been aware of the 2 changes in leadership that occurred via the buffer. I don't know if this slight change in behavior is relevant for consistency but it seems fine.
Overall it looks good to me if this isn't a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the notification was sent only when the instance becomes a leader. No notification was being sent in case of leadership loss.
Therefore, in the scenario you described, in both implementations, only 1 notification is sent.
Perhaps what you wanted to say is that if the instance acquires leadership, loses it, then acquires leadership again, in the old implementation, this would've produced 2 notifications regardless of whether the subscriber consumed the first implementation or not. In the new implementation, this will ideally send at most 3 notifications assuming that the subscriber consumes each notification immediately. However, if the subscriber doesn't consume the first notification at all, the second and third notifications won't be sent. Then if after that the subscriber is able to consume the notification on the channel, it will then be able to check the IsLeader
function.
It is important to note that the subscriber doesn't care about the history of changes in the leadership state. In practice this information is not useful in most cases. Subscribers are only interested in being notified when the state has changed. From the point of view of the subscriber, saying simultaneously that leadership state changed once or leadership state changed N times are essentially equivalent, because the subscriber doesn't care about the history of changes, but rather about the most recent state.
If at some point in the future we have a use case where the subscriber is interested in knowing the number of changes that occurred in the past, we can modify the code to support that. For now, this use case doesn't exist and is not expected to be there any time soon.
I hope this makes sense.
func (le *LeaderEngine) notify() { | ||
le.m.Lock() | ||
defer le.m.Unlock() | ||
|
||
for _, s := range le.subscribers { | ||
if len(s) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should perhaps add a comment saying this code is racy, BUT there is now way to loose a state as the as if the reader reads after len(s) > 0
check, it's still going to use the isLeaderFunc
and retrieve the last accurate state.
/merge |
Devflow running:
|
What does this PR do?
The leader election engine used to elect the cluster agent leader already provides subscription functionality in order to notify subscribers when the current process becomes the leader.
This PR extends the existing behaviour so that:
Motivation
Improve code reusability:
The refactor done in this PR aims at unifying how all these use cases deal with the current process leadership state change without needing to manually implement (almost) the same logic every time.
Describe how you validated your changes
Unit tests and e2e tests are already in place:
Teams owning concerned components (like remote config, admission controller webhook) might want to perform some manual QA.
Else, this can be marked as done based on unit tests and E2E.
Possible Drawbacks / Trade-offs
Additional Notes