Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule based networkpolicystats #1780

Merged
merged 1 commit into from
Mar 15, 2021

Conversation

ceclinux
Copy link
Contributor

@ceclinux ceclinux commented Jan 25, 2021

This PR supplements #1172 by adding rule based NetworkPolicy stats. It does the following:

  1. Add rule based network policy stats for AntreaNetworkPolicy and AntreaClusterNetworkPolicy (note native k8s networkpolicy is not supported)
  2. add Name in multiple structs, in order to get rule name to rule stats mapping at Antrea agents

closes #1669

@codecov-io
Copy link

codecov-io commented Jan 25, 2021

Codecov Report

Merging #1780 (e215c1d) into main (d77f8ff) will increase coverage by 2.32%.
The diff coverage is 96.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1780      +/-   ##
==========================================
+ Coverage   64.47%   66.80%   +2.32%     
==========================================
  Files         193      193              
  Lines       16893    16967      +74     
==========================================
+ Hits        10892    11334     +442     
+ Misses       4837     4461     -376     
- Partials     1164     1172       +8     
Flag Coverage Δ
e2e-tests 45.42% <24.74%> (?)
kind-e2e-tests 55.62% <64.94%> (+0.24%) ⬆️
unit-tests 41.66% <89.69%> (+0.28%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/types/networkpolicy.go 83.33% <ø> (ø)
pkg/querier/querier.go 57.14% <ø> (ø)
...ntroller/networkpolicy/networkpolicy_controller.go 71.71% <60.00%> (-0.29%) ⬇️
pkg/agent/controller/networkpolicy/reconciler.go 76.81% <66.66%> (-0.08%) ⬇️
pkg/agent/controller/networkpolicy/cache.go 85.29% <100.00%> (+0.04%) ⬆️
pkg/agent/stats/collector.go 98.46% <100.00%> (+0.73%) ⬆️
...kg/controller/networkpolicy/antreanetworkpolicy.go 85.00% <100.00%> (+0.30%) ⬆️
...g/controller/networkpolicy/clusternetworkpolicy.go 87.14% <100.00%> (+0.18%) ⬆️
pkg/controller/stats/aggregator.go 78.64% <100.00%> (+2.48%) ⬆️
pkg/agent/openflow/network_policy.go 76.31% <0.00%> (+0.59%) ⬆️
... and 22 more

@ceclinux ceclinux force-pushed the rule_based_networkpolicystats branch 7 times, most recently from 777afd5 to 2162a14 Compare January 25, 2021 15:29
Base automatically changed from master to main January 26, 2021 00:00
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work.

pkg/agent/controller/networkpolicy/cache.go Outdated Show resolved Hide resolved
pkg/apis/controlplane/types.go Outdated Show resolved Hide resolved
pkg/apis/stats/types.go Show resolved Hide resolved
pkg/apis/stats/types.go Show resolved Hide resolved
pkg/apis/stats/v1alpha1/types.go Outdated Show resolved Hide resolved
pkg/controller/stats/aggregator.go Outdated Show resolved Hide resolved
pkg/controller/stats/aggregator.go Outdated Show resolved Hide resolved
pkg/agent/stats/collector.go Outdated Show resolved Hide resolved
@ceclinux ceclinux force-pushed the rule_based_networkpolicystats branch from da50f9a to 9d7016c Compare January 29, 2021 02:12
@ceclinux
Copy link
Contributor Author

ceclinux commented Jan 29, 2021

This PR is out for review. Thanks

@ceclinux ceclinux force-pushed the rule_based_networkpolicystats branch 5 times, most recently from 1b50a5a to f61c407 Compare February 2, 2021 01:21
@ceclinux ceclinux changed the title WIP: Rule based networkpolicystats Rule based networkpolicystats Feb 2, 2021
@ceclinux ceclinux force-pushed the rule_based_networkpolicystats branch from f61c407 to 79dd502 Compare February 2, 2021 03:26
pkg/agent/stats/collector.go Outdated Show resolved Hide resolved
pkg/agent/stats/collector.go Outdated Show resolved Hide resolved
pkg/apis/controlplane/types.go Outdated Show resolved Hide resolved
pkg/apis/controlplane/types.go Outdated Show resolved Hide resolved
pkg/apis/stats/types.go Outdated Show resolved Hide resolved
pkg/controller/stats/aggregator.go Outdated Show resolved Hide resolved
pkg/controller/stats/aggregator.go Outdated Show resolved Hide resolved
pkg/controller/stats/aggregator.go Outdated Show resolved Hide resolved
pkg/controller/stats/aggregator.go Outdated Show resolved Hide resolved
pkg/controller/stats/aggregator.go Outdated Show resolved Hide resolved
@ceclinux ceclinux force-pushed the rule_based_networkpolicystats branch 4 times, most recently from 29fb5b7 to a6a6aff Compare February 8, 2021 10:23
pkg/agent/stats/collector.go Outdated Show resolved Hide resolved
pkg/agent/stats/collector.go Outdated Show resolved Hide resolved
pkg/apis/controlplane/types.go Outdated Show resolved Hide resolved
@ceclinux ceclinux force-pushed the rule_based_networkpolicystats branch 4 times, most recently from a51ad2d to b1e99f3 Compare February 10, 2021 03:40
@ceclinux
Copy link
Contributor Author

ceclinux commented Mar 5, 2021

There's a slim chance(around 10%) that the e2e test TestNetworkPolicyStats would fail. The error output is listed as follows

    networkpolicy_test.go:141: Got NetworkPolicy stats: &NetworkPolicyStats{ObjectMeta:{test-networkpolicy-ingress  antrea-test /apis/stats.antrea.tanzu.vmware.com/v1alpha1/namespaces/antrea-test/networkpolicystats/test-networkpolicy-ingress 6618f9af-1627-482d-987a-e24c14ca8954  0 2021-03-05 16:36:10 +0800 CST <nil> <nil> map[] map[] [] []  []},TrafficStats:TrafficStats{Packets:42,Bytes:2860,Sessions:10,},}
    networkpolicy_test.go:141: Got NetworkPolicy stats: &NetworkPolicyStats{ObjectMeta:{test-networkpolicy-egress  antrea-test /apis/stats.antrea.tanzu.vmware.com/v1alpha1/namespaces/antrea-test/networkpolicystats/test-networkpolicy-egress d4c8289c-1f84-4d59-9f1d-16bdcdaf7ede  0 2021-03-05 16:36:10 +0800 CST <nil> <nil> map[] map[] [] []  []},TrafficStats:TrafficStats{Packets:64,Bytes:4400,Sessions:11,},}
    networkpolicy_test.go:141: Got NetworkPolicy stats: &NetworkPolicyStats{ObjectMeta:{test-networkpolicy-ingress  antrea-test /apis/stats.antrea.tanzu.vmware.com/v1alpha1/namespaces/antrea-test/networkpolicystats/test-networkpolicy-ingress 6618f9af-1627-482d-987a-e24c14ca8954  0 2021-03-05 16:36:10 +0800 CST <nil> <nil> map[] map[] [] []  []},TrafficStats:TrafficStats{Packets:42,Bytes:2860,Sessions:10,},}
    networkpolicy_test.go:141: Got NetworkPolicy stats: &NetworkPolicyStats{ObjectMeta:{test-networkpolicy-egress  antrea-test /apis/stats.antrea.tanzu.vmware.com/v1alpha1/namespaces/antrea-test/networkpolicystats/test-networkpolicy-egress d4c8289c-1f84-4d59-9f1d-16bdcdaf7ede  0 2021-03-05 16:36:10 +0800 CST <nil> <nil> map[] map[] [] []  []},TrafficStats:TrafficStats{Packets:64,Bytes:4400,Sessions:11,},}

After discussion with @tnqn, from the packet we dumped, the failed test is caused by slow reply of first SYN packet from nginx server. Therefore, the busybox client has to resent SYN packet after 1s timeout, causing an extra session count on egress policy stats. Although the test is reported failed in this case, the calculation of stats still match our design and is majorly contributed to performance degradation of kind cluster. So I suggest here we can leave this test unchanged.

FYI, here is the the setup of our packet analyzing :

In kind cluster, two pods are deploy on separate nodes(busybox pod on kind-control-plane and nginx pod on kind-worker).Then we ran TestNetworkPolicyStats test on this kind cluster. Meanwhile, two tcpdump -i eth0 -n port 6081 -w eth0.pcap sessions was opened on both nodes to capture packets. Attached are packets we inspected and result of ovs-ofctl dump-flows br-int running inside the pod of antrea-agent on the kind-worker node.

kind_worder_tcpdump.log
kind_control_plane_tcpdump.log
kind-worker-antrea-agent-ovs.log

@tnqn
Copy link
Member

tnqn commented Mar 5, 2021

@ceclinux I just found the packet numbers were not consistent, the egress stats was 64 packets while the ingress stats was 42 packets, and I counted the actual packets you pasted, there were 42 packets in the request direction, so there was an issue with ingress stats that it didn't count reply packets. Could you check if it's introduced by this PR or it's in main branch too? If latter, I'm fine with merging this PR first and hope you can help fix it with another PR.

@antoninbas
Copy link
Contributor

@ceclinux could we increase the 1s timeout to make the test more robust?

@ceclinux
Copy link
Contributor Author

ceclinux commented Mar 8, 2021

@ceclinux I just found the packet numbers were not consistent, the egress stats was 64 packets while the ingress stats was 42 packets, and I counted the actual packets you pasted, there were 42 packets in the request direction, so there was an issue with ingress stats that it didn't count reply packets. Could you check if it's introduced by this PR or it's in main branch too? If latter, I'm fine with merging this PR first and hope you can help fix it with another PR.

I checked the log and found both tcpdump results give 64 packets. So I think what you meant is there were 64 packets in the request direction. Anyway, I rerun the packet capturing session on the main branch and the miscalculation bug still exists. You can also find similar case here https://github.com/vmware-tanzu/antrea/runs/1959122332?check_suite_focus=true

Meanwhile, I found the failed TestNetworkPolicyStats case on the main branch CI https://github.com/vmware-tanzu/antrea/runs/2025627761?check_suite_focus=true

So I think the possible bugs we dicussed are not induced by this PR.

@ceclinux
Copy link
Contributor Author

ceclinux commented Mar 8, 2021

@ceclinux could we increase the 1s timeout to make the test more robust?

In my option, The test is not comprehensive and can give false positive error sometimes. I think we need to fix miscalculation bug metioned by @tnqn first and rewrite this test after that.

@tnqn
Copy link
Member

tnqn commented Mar 11, 2021

@ceclinux thanks for verifying it.
@antoninbas will you give another review to it? It currently LGTM. And as the flakiness and the miscalculation is not introduced by this PR, I would agree we merge it first and have separate PR to fix them.

@tnqn
Copy link
Member

tnqn commented Mar 11, 2021

@ceclinux could we increase the 1s timeout to make the test more robust?

I had a debug session with @ceclinux, it seemed increasing timeout here didn't help, because the request was not dropped but was delayed. It finally received the reply but was after it retransmitted. However, it didn't explain why only egress count is 11, ingress count is not.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let @tnqn have the final word on this since I didn't review it in details. I believe we should document the stats API in https://github.com/vmware-tanzu/antrea/blob/main/docs/antrea-network-policy.md, in a separate PR.

pkg/agent/stats/collector.go Outdated Show resolved Hide resolved
for _, v := range incMap {
addUp(ruleSumStats, v)
}
// accumulate the rule traffic stats if the rule has already 'existed' in the ruleStats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the rule already exists in the ruleStats

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ceclinux and this one?

@antoninbas
Copy link
Contributor

@tnqn @ceclinux for the retransmit issue, is it possible you are just running into that well-known issue: https://mail.openvswitch.org/pipermail/ovs-dev/2020-March/368577.html. When using the userspace OVS datapath and tunneling, the first IP packet sent on a tunnel is always dropped because of a missing ARP entry. The only workaround is to "warm-up" the tunnel or add the ARP entry manually. This doesn't happen with the Linux Kernel, which typically buffers the IP packet while waiting for the ARP resolution.

@tnqn
Copy link
Member

tnqn commented Mar 11, 2021

@tnqn @ceclinux for the retransmit issue, is it possible you are just running into that well-known issue: https://mail.openvswitch.org/pipermail/ovs-dev/2020-March/368577.html. When using the userspace OVS datapath and tunneling, the first IP packet sent on a tunnel is always dropped because of a missing ARP entry. The only workaround is to "warm-up" the tunnel or add the ARP entry manually. This doesn't happen with the Linux Kernel, which typically buffers the IP packet while waiting for the ARP resolution.

@antoninbas Thanks for the information! It looks like the symptom of the same root cause. We only captured TCP packets, I imagine that the first reply packet was dropped because of missing ARP entry and we actually got the second reply packet, which could explain the delay.

@ceclinux could you add a "warm-up" step between the two test Pods and see if the test becomes stable?

@ceclinux ceclinux force-pushed the rule_based_networkpolicystats branch 2 times, most recently from f0aa4ef to 8aa142c Compare March 11, 2021 11:39
@ceclinux
Copy link
Contributor Author

@antoninbas @tnqn After adding the "warm-up" step , the extra session count problem is resolved (I checked by running 100 rounds TestNetworkPolicyStats on my local machine). You can check the updated TestNetworkPolicyStats test in this PR.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in the test code LGTM

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, LGTM

@tnqn
Copy link
Member

tnqn commented Mar 12, 2021

/test-all

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test, the result is not correct.

  1. The name was not present in ruleTrafficStats.
  2. All stats were accumulated in one ruleTrafficStats when there were 4 rules.

Could you check multiple rules case and rule name?

My rule:

spec:
  appliedTo:
  - namespaceSelector:
      matchLabels:
        env: prod
  egress:
  - action: Allow
    enableLogging: false
    name: egress-allow-prod
    to:
    - namespaceSelector:
        matchLabels:
          env: prod
  - action: Drop
    enableLogging: false
    name: egress-deny-all
  ingress:
  - action: Allow
    enableLogging: false
    from:
    - namespaceSelector:
        matchLabels:
          env: prod
    name: ingress-allow-prod
  - action: Drop
    enableLogging: false
    name: ingress-deny-all
  priority: 5
  tier: platform

The stats:

kind: AntreaClusterNetworkPolicyStats
metadata:
  creationTimestamp: "2021-03-12T02:38:34Z"
  name: deny-all-except-prod
  selfLink: /apis/stats.antrea.tanzu.vmware.com/v1alpha1/antreaclusternetworkpolicystats/deny-all-except-prod
  uid: c2036533-82d9-4960-8428-c844edfbb1c2
ruleTrafficStats:
- trafficStats:
    bytes: 3082
    packets: 28
    sessions: 9
trafficStats:
  bytes: 3082
  packets: 28
  sessions: 9

for _, v := range incMap {
addUp(ruleSumStats, v)
}
// accumulate the rule traffic stats if the rule has already 'existed' in the ruleStats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ceclinux and this one?

@tnqn
Copy link
Member

tnqn commented Mar 12, 2021

I found where the problem is. The PR only added rule name for AntreaNetworkPolicy, not AntreaClusterNetworkPolicy. So the latter got no name in agent. We should add an e2e test for AntreaClusterNetworkPolicy stats too.

@ceclinux ceclinux force-pushed the rule_based_networkpolicystats branch from 8aa142c to a4715ad Compare March 12, 2021 05:50
@ceclinux
Copy link
Contributor Author

@tnqn I fixed the ClusterNetworkPolicy RuleTrafficStats and has added e2e test for it, thanks for checking

@ceclinux ceclinux force-pushed the rule_based_networkpolicystats branch 2 times, most recently from b7ccf72 to 158b95a Compare March 12, 2021 08:27
This PR supports collecting and querying the NetworkPolicy statistics for
Antrea Networkpolicies. Native k8s Networkpolicies are not supported
@ceclinux ceclinux force-pushed the rule_based_networkpolicystats branch from 158b95a to e215c1d Compare March 12, 2021 09:01
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Mar 12, 2021

/test-all

@tnqn
Copy link
Member

tnqn commented Mar 15, 2021

/test-e2e
/test-all-features-conformance

@tnqn
Copy link
Member

tnqn commented Mar 15, 2021

/test-e2e

@tnqn tnqn merged commit e80ab3b into antrea-io:main Mar 15, 2021
tnqn pushed a commit to tnqn/antrea that referenced this pull request Mar 17, 2021
This PR supports collecting and querying the NetworkPolicy statistics for
Antrea Networkpolicies. Native k8s Networkpolicies are not supported
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support rule stats in NetworkPolicy(Also in ANP, ACNP)
5 participants