plugin/kubernetes: Fix dns programming duration metric #4255

chrisohaver · 2020-11-03T15:20:39Z

1. Why is this pull request needed and what does it do?

Currently, the DNSProgrammingLatency metric does not produce any data. This is because it relies on information which has been cleared by the time we calculate the latency and record the metric. The unit tests for the metric pass, because they dubiously disable the clearing of the data during the test (skipCleanup bool).

This PR does the following

Fix the DNSProgrammingLatency metric by collecting required info from original object before clearing it out.
Remove skipCleanup boolean to simplify code
Correct the tests that relied on skipCleanup

2. Which issues (if any) are related?

Closes: #4244 #4253

3. Which documentation changes (if any) need to be made?

4. Does this introduce a backward incompatible change or deprecation?

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

plugin/kubernetes/object/service.go

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

plugin/kubernetes/controller.go

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver · 2020-11-03T22:23:31Z

... a data race. I'm guessing this might have been the specific "problem" the original author was averting with the "skipCleanup" bool. I wonder if this is really just an issue with "fakeClient" or if setting the objects to empty structs is inherently unsafe. If we can confidently know that it's just an issue with the "fakeClient" we can exclude this test from race testing. But seems reasonable that it could be an issue with a real client.

chrisohaver · 2020-11-03T22:57:14Z

But seems reasonable that it could be an issue with a real client.

I think the issue is that we are adding the object ourselves in the test. If we ever had CoreDNS actually add a service pod or endpoint, this might be a real issue, but we do not. If thats correct, I think we can safely skip race detection here.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver · 2020-11-06T17:33:35Z

I've refactored the latency unit test to not use the fakeClient.

codecov-io · 2020-11-06T17:41:49Z

Codecov Report

Merging #4255 (69d4fe9) into master (f286a24) will decrease coverage by 0.85%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #4255      +/-   ##
==========================================
- Coverage   55.85%   55.00%   -0.86%     
==========================================
  Files         222      223       +1     
  Lines        9875     9909      +34     
==========================================
- Hits         5516     5450      -66     
- Misses       3895     4014     +119     
+ Partials      464      445      -19

Impacted Files	Coverage Δ
plugin/kubernetes/controller.go	`0.00% <0.00%> (-38.12%)`	⬇️
plugin/trace/trace.go	`67.30% <0.00%> (-3.29%)`	⬇️
plugin/trace/setup.go	`66.21% <0.00%> (-2.28%)`	⬇️
plugin/pkg/tls/tls.go	`70.21% <0.00%> (-1.22%)`	⬇️
plugin/pkg/doh/doh.go	`60.78% <0.00%> (ø)`
plugin/dnstap/writer.go	`61.90% <0.00%> (ø)`
plugin/dnstap/handler.go	`100.00% <0.00%> (ø)`
plugin/dnstap/dnstapio/io.go
plugin/dnstap/dnstapio/dnstap_encoder.go
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f286a24...1f70e50. Read the comment docs.

miekg · 2020-11-06T17:42:22Z

plugin/kubernetes/object/informer.go

 			}
 			return nil
 		}
 	}
 }

+func cleanObj(i interface{}) {


I find it way more obvious if the endpoint cleanup happens in the To functions, this is too hidden

If it happens in the To functions, the data is erased before it is needed when calculating latency after processing is done. Currently, this happens, and it causes the latency metric to never report anything.

no. cleaning up in the To functions is where this code belong, now it becomes hidden and we can only hope the switch will catch all different types

miekg · 2020-11-06T17:44:20Z

plugin/kubernetes/object/service.go

-			return nil, fmt.Errorf("unexpected object %v", obj)
-		}
-		return toService(skipCleanup, svc), nil
+func ToService(obj interface{}) (interface{}, error) {


we should not facilitate the use of interface{} any more than needed. It with my previous comment to keep the niling this can be kept as-is.

If (even) the reworked tests, still needs tweaks in the code used for talking to k8s it should just go

The tweaks in the production code in this PR are only done to remove skipCleanup boolean, and to fix the latency feature, which currently doesn't work in reality (because it relies on skipCleanup = true ).

miekg · 2020-11-06T17:46:00Z

I really don't care about this metrics tests. I care about clean(er) code in this plugin. If (even) the reworked tests, still needs tweaks in the code used for talking to k8s it should just go (as I did in my PR). Moving to circle-ci seems to be the best option.

chrisohaver · 2020-11-06T18:21:37Z

Apologies, I really didn't make it clear at all that I was primarily fixing the latency reporting metric in this PR, and then adapting the tests to be able to work with the fix.

The 3 things done in this PR...

Fix endpoint object latency metrics (previously the metric was not reported) - this entailed moving the cleanup to after the latency calculation.
remove the skipCleanup boolean - resulting in always cleaning up, but this breaks the unit test, which relies on skip cleanup...
refactor unit test to work without skip cleanup. Ultimately i had to ditch fakeclient, because the object clean up causes a race condition when the test is acting as both api client and server.

miekg · 2020-11-09T11:45:16Z

nacking this.
This should be tested in some e2e fashion, not but again making code less readable esp, in an area where we need to jump to a lot of hoops to get where we are.

chrisohaver · 2020-11-09T12:27:30Z

OK. How about we just fix the metric in this PR, and drop all the metric related unit tests.

…calling toFuncs Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver · 2020-11-09T14:48:03Z

OK, I have refactored things to move the cleanup step back into the toFuncs.

I replaced the latencyFunc with a LatencyRecorder, which has an init separate from a record function, to allow the Processor to get the trigger timestamp and parent services before calling the toFuncs, and write the metric after calling toFuncs and updating index.

I left the unit tests in for now. Since they test production code, I think they have value, but I will remove them if you insist. None of the code changes here are made to assist or enable the unit tests. The tests are unit tests, i.e. they don't do a full e2e test, so adding e2e tests are still useful from a e2e standpoint.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver · 2020-11-11T13:22:06Z

@miekg, PTAL. I've taken your advice and moved the cleanup back into the ToFuncs, which has resulted in easier to read code.

chrisohaver · 2020-11-17T14:12:28Z

The DNS programming latency metric tests in coredns/ci pass with this PR.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver · 2020-11-19T19:23:01Z

@miekg, PTAL. Per your request, I have ...

removed the latency metric unit tests
changed the ToFuncs and ToFunc type to not use empty interfaces

chrisohaver · 2020-11-30T14:35:14Z

pinging plugin/kubernetes owners: @bradbeam @johnbelamaric @miekg @rajansandeep @yongtang @zouyee

chrisohaver added 3 commits November 3, 2020 09:58

always clean up, but after latency check

c4ecf5f

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

refactor out unnecessary toFunc wrappers

777272e

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

fix var names

d7582ac

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver requested review from bradbeam, johnbelamaric, miekg, rajansandeep, yongtang and zouyee as code owners November 3, 2020 15:20

stickler-ci reviewed Nov 3, 2020

View reviewed changes

plugin/kubernetes/object/service.go Outdated Show resolved Hide resolved

restore func comment

948f690

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver mentioned this pull request Nov 3, 2020

Remove boolean parameter from object.To() functions #4253

Closed

remove debug

e7267cb

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

bradbeam approved these changes Nov 3, 2020

View reviewed changes

plugin/kubernetes/controller.go Outdated Show resolved Hide resolved

use a generic cleanObj func

df1d277

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver added 2 commits November 5, 2020 12:33

return

eb3db8b

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

dont use fakeclient

af8126b

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

miekg reviewed Nov 6, 2020

View reviewed changes

chrisohaver changed the title ~~plugin/kubernetes: refactor dns programming duration metric~~ plugin/kubernetes: Fix dns programming duration metric Nov 6, 2020

move cleanup back to toFuncs; get data reqd to record latency before …

259dc4b

…calling toFuncs Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

fix import ordering

3b82b40

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver mentioned this pull request Nov 11, 2020

Add e2e test for dns programming latency metric coredns/ci#137

Merged

chrisohaver added 3 commits November 17, 2020 10:54

alter ToFuncs to work with meta.Object instead of empty interfaces

8c1ecc7

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

remove latency metric unit tests

69d4fe9

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

fix comments

1f70e50

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

zouyee approved these changes Dec 1, 2020

View reviewed changes

chrisohaver merged commit 9121e78 into coredns:master Dec 1, 2020

chrisohaver mentioned this pull request Dec 18, 2020

Inconsistent behaviour with forward + cache #4189

Closed

chrisohaver deleted the fix-metric branch January 9, 2021 14:45

miekg mentioned this pull request Jan 12, 2021

1.8.1 notes: sort PR #4373

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plugin/kubernetes: Fix dns programming duration metric #4255

plugin/kubernetes: Fix dns programming duration metric #4255

chrisohaver commented Nov 3, 2020 •

edited

Loading

chrisohaver commented Nov 3, 2020 •

edited

Loading

chrisohaver commented Nov 3, 2020

chrisohaver commented Nov 6, 2020

codecov-io commented Nov 6, 2020 •

edited

Loading

miekg Nov 6, 2020

chrisohaver Nov 6, 2020

miekg Nov 9, 2020

miekg Nov 6, 2020

chrisohaver Nov 6, 2020

miekg commented Nov 6, 2020

chrisohaver commented Nov 6, 2020

miekg commented Nov 9, 2020

chrisohaver commented Nov 9, 2020

chrisohaver commented Nov 9, 2020

chrisohaver commented Nov 11, 2020

chrisohaver commented Nov 17, 2020

chrisohaver commented Nov 19, 2020

chrisohaver commented Nov 30, 2020

plugin/kubernetes: Fix dns programming duration metric #4255

plugin/kubernetes: Fix dns programming duration metric #4255

Conversation

chrisohaver commented Nov 3, 2020 • edited Loading

1. Why is this pull request needed and what does it do?

2. Which issues (if any) are related?

3. Which documentation changes (if any) need to be made?

4. Does this introduce a backward incompatible change or deprecation?

chrisohaver commented Nov 3, 2020 • edited Loading

chrisohaver commented Nov 3, 2020

chrisohaver commented Nov 6, 2020

codecov-io commented Nov 6, 2020 • edited Loading

Codecov Report

miekg Nov 6, 2020

Choose a reason for hiding this comment

chrisohaver Nov 6, 2020

Choose a reason for hiding this comment

miekg Nov 9, 2020

Choose a reason for hiding this comment

miekg Nov 6, 2020

Choose a reason for hiding this comment

chrisohaver Nov 6, 2020

Choose a reason for hiding this comment

miekg commented Nov 6, 2020

chrisohaver commented Nov 6, 2020

miekg commented Nov 9, 2020

chrisohaver commented Nov 9, 2020

chrisohaver commented Nov 9, 2020

chrisohaver commented Nov 11, 2020

chrisohaver commented Nov 17, 2020

chrisohaver commented Nov 19, 2020

chrisohaver commented Nov 30, 2020

chrisohaver commented Nov 3, 2020 •

edited

Loading

chrisohaver commented Nov 3, 2020 •

edited

Loading

codecov-io commented Nov 6, 2020 •

edited

Loading