-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rule based networkpolicystats #1780
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1780 +/- ##
==========================================
+ Coverage 64.47% 66.80% +2.32%
==========================================
Files 193 193
Lines 16893 16967 +74
==========================================
+ Hits 10892 11334 +442
+ Misses 4837 4461 -376
- Partials 1164 1172 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
|
777afd5
to
2162a14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work.
da50f9a
to
9d7016c
Compare
This PR is out for review. Thanks |
1b50a5a
to
f61c407
Compare
f61c407
to
79dd502
Compare
29fb5b7
to
a6a6aff
Compare
a51ad2d
to
b1e99f3
Compare
There's a slim chance(around 10%) that the e2e test
After discussion with @tnqn, from the packet we dumped, the failed test is caused by slow reply of first FYI, here is the the setup of our packet analyzing : In kind cluster, two pods are deploy on separate nodes( kind_worder_tcpdump.log |
@ceclinux I just found the packet numbers were not consistent, the egress stats was 64 packets while the ingress stats was 42 packets, and I counted the actual packets you pasted, there were 42 packets in the request direction, so there was an issue with ingress stats that it didn't count reply packets. Could you check if it's introduced by this PR or it's in main branch too? If latter, I'm fine with merging this PR first and hope you can help fix it with another PR. |
@ceclinux could we increase the 1s timeout to make the test more robust? |
I checked the log and found both Meanwhile, I found the failed So I think the possible bugs we dicussed are not induced by this PR. |
@ceclinux thanks for verifying it. |
I had a debug session with @ceclinux, it seemed increasing timeout here didn't help, because the request was not dropped but was delayed. It finally received the reply but was after it retransmitted. However, it didn't explain why only egress count is 11, ingress count is not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @tnqn have the final word on this since I didn't review it in details. I believe we should document the stats API in https://github.com/vmware-tanzu/antrea/blob/main/docs/antrea-network-policy.md, in a separate PR.
pkg/controller/stats/aggregator.go
Outdated
for _, v := range incMap { | ||
addUp(ruleSumStats, v) | ||
} | ||
// accumulate the rule traffic stats if the rule has already 'existed' in the ruleStats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the rule already exists in the ruleStats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ceclinux and this one?
@tnqn @ceclinux for the retransmit issue, is it possible you are just running into that well-known issue: https://mail.openvswitch.org/pipermail/ovs-dev/2020-March/368577.html. When using the userspace OVS datapath and tunneling, the first IP packet sent on a tunnel is always dropped because of a missing ARP entry. The only workaround is to "warm-up" the tunnel or add the ARP entry manually. This doesn't happen with the Linux Kernel, which typically buffers the IP packet while waiting for the ARP resolution. |
@antoninbas Thanks for the information! It looks like the symptom of the same root cause. We only captured TCP packets, I imagine that the first reply packet was dropped because of missing ARP entry and we actually got the second reply packet, which could explain the delay. @ceclinux could you add a "warm-up" step between the two test Pods and see if the test becomes stable? |
f0aa4ef
to
8aa142c
Compare
@antoninbas @tnqn After adding the "warm-up" step , the extra session count problem is resolved (I checked by running 100 rounds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in the test code LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, LGTM
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a test, the result is not correct.
- The name was not present in ruleTrafficStats.
- All stats were accumulated in one ruleTrafficStats when there were 4 rules.
Could you check multiple rules case and rule name?
My rule:
spec:
appliedTo:
- namespaceSelector:
matchLabels:
env: prod
egress:
- action: Allow
enableLogging: false
name: egress-allow-prod
to:
- namespaceSelector:
matchLabels:
env: prod
- action: Drop
enableLogging: false
name: egress-deny-all
ingress:
- action: Allow
enableLogging: false
from:
- namespaceSelector:
matchLabels:
env: prod
name: ingress-allow-prod
- action: Drop
enableLogging: false
name: ingress-deny-all
priority: 5
tier: platform
The stats:
kind: AntreaClusterNetworkPolicyStats
metadata:
creationTimestamp: "2021-03-12T02:38:34Z"
name: deny-all-except-prod
selfLink: /apis/stats.antrea.tanzu.vmware.com/v1alpha1/antreaclusternetworkpolicystats/deny-all-except-prod
uid: c2036533-82d9-4960-8428-c844edfbb1c2
ruleTrafficStats:
- trafficStats:
bytes: 3082
packets: 28
sessions: 9
trafficStats:
bytes: 3082
packets: 28
sessions: 9
pkg/controller/stats/aggregator.go
Outdated
for _, v := range incMap { | ||
addUp(ruleSumStats, v) | ||
} | ||
// accumulate the rule traffic stats if the rule has already 'existed' in the ruleStats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ceclinux and this one?
I found where the problem is. The PR only added rule name for AntreaNetworkPolicy, not AntreaClusterNetworkPolicy. So the latter got no name in agent. We should add an e2e test for AntreaClusterNetworkPolicy stats too. |
8aa142c
to
a4715ad
Compare
@tnqn I fixed the |
b7ccf72
to
158b95a
Compare
This PR supports collecting and querying the NetworkPolicy statistics for Antrea Networkpolicies. Native k8s Networkpolicies are not supported
158b95a
to
e215c1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
/test-e2e |
/test-e2e |
This PR supports collecting and querying the NetworkPolicy statistics for Antrea Networkpolicies. Native k8s Networkpolicies are not supported
This PR supplements #1172 by adding rule based NetworkPolicy stats. It does the following:
AntreaNetworkPolicy
andAntreaClusterNetworkPolicy
(note native k8s networkpolicy is not supported)Name
in multiple structs, in order to get rule name to rule stats mapping at Antrea agentscloses #1669