-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plugin/kubernetes: Fix dns programming duration metric #4255
Conversation
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
... a data race. I'm guessing this might have been the specific "problem" the original author was averting with the "skipCleanup" bool. I wonder if this is really just an issue with "fakeClient" or if setting the objects to empty structs is inherently unsafe. If we can confidently know that it's just an issue with the "fakeClient" we can exclude this test from race testing. But seems reasonable that it could be an issue with a real client. |
I think the issue is that we are adding the object ourselves in the test. If we ever had CoreDNS actually add a service pod or endpoint, this might be a real issue, but we do not. If thats correct, I think we can safely skip race detection here. |
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
I've refactored the latency unit test to not use the fakeClient. |
Codecov Report
@@ Coverage Diff @@
## master #4255 +/- ##
==========================================
- Coverage 55.85% 55.00% -0.86%
==========================================
Files 222 223 +1
Lines 9875 9909 +34
==========================================
- Hits 5516 5450 -66
- Misses 3895 4014 +119
+ Partials 464 445 -19
Continue to review full report at Codecov.
|
plugin/kubernetes/object/informer.go
Outdated
} | ||
return nil | ||
} | ||
} | ||
} | ||
|
||
func cleanObj(i interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it way more obvious if the endpoint cleanup happens in the To functions, this is too hidden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it happens in the To functions, the data is erased before it is needed when calculating latency after processing is done. Currently, this happens, and it causes the latency metric to never report anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. cleaning up in the To functions is where this code belong, now it becomes hidden and we can only hope the switch will catch all different types
plugin/kubernetes/object/service.go
Outdated
return nil, fmt.Errorf("unexpected object %v", obj) | ||
} | ||
return toService(skipCleanup, svc), nil | ||
func ToService(obj interface{}) (interface{}, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not facilitate the use of interface{}
any more than needed. It with my previous comment to keep the nil
ing this can be kept as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If (even) the reworked tests, still needs tweaks in the code used for talking to k8s it should just go
The tweaks in the production code in this PR are only done to remove skipCleanup boolean, and to fix the latency feature, which currently doesn't work in reality (because it relies on skipCleanup = true ).
I really don't care about this metrics tests. I care about clean(er) code in this plugin. If (even) the reworked tests, still needs tweaks in the code used for talking to k8s it should just go (as I did in my PR). Moving to circle-ci seems to be the best option. |
Apologies, I really didn't make it clear at all that I was primarily fixing the latency reporting metric in this PR, and then adapting the tests to be able to work with the fix. The 3 things done in this PR...
|
nacking this. |
OK. How about we just fix the metric in this PR, and drop all the metric related unit tests. |
…calling toFuncs Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
OK, I have refactored things to move the cleanup step back into the toFuncs. I replaced the latencyFunc with a LatencyRecorder, which has an init separate from a record function, to allow the Processor to get the trigger timestamp and parent services before calling the toFuncs, and write the metric after calling toFuncs and updating index. I left the unit tests in for now. Since they test production code, I think they have value, but I will remove them if you insist. None of the code changes here are made to assist or enable the unit tests. The tests are unit tests, i.e. they don't do a full e2e test, so adding e2e tests are still useful from a e2e standpoint. |
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@miekg, PTAL. I've taken your advice and moved the cleanup back into the ToFuncs, which has resulted in easier to read code. |
The DNS programming latency metric tests in coredns/ci pass with this PR. |
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@miekg, PTAL. Per your request, I have ...
|
pinging plugin/kubernetes owners: @bradbeam @johnbelamaric @miekg @rajansandeep @yongtang @zouyee |
1. Why is this pull request needed and what does it do?
Currently, the DNSProgrammingLatency metric does not produce any data. This is because it relies on information which has been cleared by the time we calculate the latency and record the metric. The unit tests for the metric pass, because they dubiously disable the clearing of the data during the test (skipCleanup bool).
This PR does the following
2. Which issues (if any) are related?
Closes: #4244 #4253
3. Which documentation changes (if any) need to be made?
4. Does this introduce a backward incompatible change or deprecation?