Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin/kubernetes: Fix dns programming duration metric #4255

Merged
merged 13 commits into from
Dec 1, 2020

Conversation

chrisohaver
Copy link
Member

@chrisohaver chrisohaver commented Nov 3, 2020

1. Why is this pull request needed and what does it do?

Currently, the DNSProgrammingLatency metric does not produce any data. This is because it relies on information which has been cleared by the time we calculate the latency and record the metric. The unit tests for the metric pass, because they dubiously disable the clearing of the data during the test (skipCleanup bool).

This PR does the following

  1. Fix the DNSProgrammingLatency metric by collecting required info from original object before clearing it out.
  2. Remove skipCleanup boolean to simplify code
  3. Correct the tests that relied on skipCleanup

2. Which issues (if any) are related?

Closes: #4244 #4253

3. Which documentation changes (if any) need to be made?

4. Does this introduce a backward incompatible change or deprecation?

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
plugin/kubernetes/controller.go Outdated Show resolved Hide resolved
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver
Copy link
Member Author

chrisohaver commented Nov 3, 2020

... a data race. I'm guessing this might have been the specific "problem" the original author was averting with the "skipCleanup" bool. I wonder if this is really just an issue with "fakeClient" or if setting the objects to empty structs is inherently unsafe. If we can confidently know that it's just an issue with the "fakeClient" we can exclude this test from race testing. But seems reasonable that it could be an issue with a real client.

@chrisohaver
Copy link
Member Author

But seems reasonable that it could be an issue with a real client.

I think the issue is that we are adding the object ourselves in the test. If we ever had CoreDNS actually add a service pod or endpoint, this might be a real issue, but we do not. If thats correct, I think we can safely skip race detection here.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver
Copy link
Member Author

I've refactored the latency unit test to not use the fakeClient.

@codecov-io
Copy link

codecov-io commented Nov 6, 2020

Codecov Report

Merging #4255 (69d4fe9) into master (f286a24) will decrease coverage by 0.85%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4255      +/-   ##
==========================================
- Coverage   55.85%   55.00%   -0.86%     
==========================================
  Files         222      223       +1     
  Lines        9875     9909      +34     
==========================================
- Hits         5516     5450      -66     
- Misses       3895     4014     +119     
+ Partials      464      445      -19     
Impacted Files Coverage Δ
plugin/kubernetes/controller.go 0.00% <0.00%> (-38.12%) ⬇️
plugin/trace/trace.go 67.30% <0.00%> (-3.29%) ⬇️
plugin/trace/setup.go 66.21% <0.00%> (-2.28%) ⬇️
plugin/pkg/tls/tls.go 70.21% <0.00%> (-1.22%) ⬇️
plugin/pkg/doh/doh.go 60.78% <0.00%> (ø)
plugin/dnstap/writer.go 61.90% <0.00%> (ø)
plugin/dnstap/handler.go 100.00% <0.00%> (ø)
plugin/dnstap/dnstapio/io.go
plugin/dnstap/dnstapio/dnstap_encoder.go
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f286a24...1f70e50. Read the comment docs.

}
return nil
}
}
}

func cleanObj(i interface{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it way more obvious if the endpoint cleanup happens in the To functions, this is too hidden

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it happens in the To functions, the data is erased before it is needed when calculating latency after processing is done. Currently, this happens, and it causes the latency metric to never report anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. cleaning up in the To functions is where this code belong, now it becomes hidden and we can only hope the switch will catch all different types

return nil, fmt.Errorf("unexpected object %v", obj)
}
return toService(skipCleanup, svc), nil
func ToService(obj interface{}) (interface{}, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not facilitate the use of interface{} any more than needed. It with my previous comment to keep the niling this can be kept as-is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If (even) the reworked tests, still needs tweaks in the code used for talking to k8s it should just go

The tweaks in the production code in this PR are only done to remove skipCleanup boolean, and to fix the latency feature, which currently doesn't work in reality (because it relies on skipCleanup = true ).

@miekg
Copy link
Member

miekg commented Nov 6, 2020

I really don't care about this metrics tests. I care about clean(er) code in this plugin. If (even) the reworked tests, still needs tweaks in the code used for talking to k8s it should just go (as I did in my PR). Moving to circle-ci seems to be the best option.

@chrisohaver
Copy link
Member Author

Apologies, I really didn't make it clear at all that I was primarily fixing the latency reporting metric in this PR, and then adapting the tests to be able to work with the fix.

The 3 things done in this PR...

  1. Fix endpoint object latency metrics (previously the metric was not reported) - this entailed moving the cleanup to after the latency calculation.
  2. remove the skipCleanup boolean - resulting in always cleaning up, but this breaks the unit test, which relies on skip cleanup...
  3. refactor unit test to work without skip cleanup. Ultimately i had to ditch fakeclient, because the object clean up causes a race condition when the test is acting as both api client and server.

@chrisohaver chrisohaver changed the title plugin/kubernetes: refactor dns programming duration metric plugin/kubernetes: Fix dns programming duration metric Nov 6, 2020
@miekg
Copy link
Member

miekg commented Nov 9, 2020

nacking this.
This should be tested in some e2e fashion, not but again making code less readable esp, in an area where we need to jump to a lot of hoops to get where we are.

@chrisohaver
Copy link
Member Author

OK. How about we just fix the metric in this PR, and drop all the metric related unit tests.

…calling toFuncs

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver
Copy link
Member Author

OK, I have refactored things to move the cleanup step back into the toFuncs.

I replaced the latencyFunc with a LatencyRecorder, which has an init separate from a record function, to allow the Processor to get the trigger timestamp and parent services before calling the toFuncs, and write the metric after calling toFuncs and updating index.

I left the unit tests in for now. Since they test production code, I think they have value, but I will remove them if you insist. None of the code changes here are made to assist or enable the unit tests. The tests are unit tests, i.e. they don't do a full e2e test, so adding e2e tests are still useful from a e2e standpoint.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver
Copy link
Member Author

@miekg, PTAL. I've taken your advice and moved the cleanup back into the ToFuncs, which has resulted in easier to read code.

@chrisohaver
Copy link
Member Author

The DNS programming latency metric tests in coredns/ci pass with this PR.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver
Copy link
Member Author

@miekg, PTAL. Per your request, I have ...

  • removed the latency metric unit tests
  • changed the ToFuncs and ToFunc type to not use empty interfaces

@chrisohaver
Copy link
Member Author

pinging plugin/kubernetes owners: @bradbeam @johnbelamaric @miekg @rajansandeep @yongtang @zouyee

@chrisohaver chrisohaver merged commit 9121e78 into coredns:master Dec 1, 2020
@chrisohaver chrisohaver deleted the fix-metric branch January 9, 2021 14:45
@miekg miekg mentioned this pull request Jan 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants