-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multus_tests, sriov: add vlan=0 to NAD #5093
Conversation
Thank you! Could you please open an issue on the SR-IOV CNI and link it here? BTW, in classic networking, there is a difference between a non VLAN and a VLAN=0. No VLAN means that it is not a trunk (no tagged traffic) while VLAN 0 means that the ethernet header includes a VLAN just for the priority bits. So I find this strange that they are considered the same. |
Done: openshift/sriov-cni#25
Thanks for the explanation :) |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
is there a way to tell it to totally "untag" instead of VLAN ID 0 ?
in case 0 removes the vlan header its the same,
but there are basically two options that will allow traffic:
- dont have vlan header (which is better than option 2)
- have vlan header with vlan tag 0 (and then for example you can keep the 802.1p priority fields)
Edit:
i see now Edy said the same
Thanks @oshoval . |
5d7882e
to
937945c
Compare
/retest |
2 similar comments
/retest |
/retest |
/sig network |
937945c
to
1824766
Compare
Rebased on master |
/cc @EdDev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message comments:
multus_tests, sriov: add vlan=0 to NAD
tests, sriov: Explicitly set vlan=0 to NADs with no VLAN defined
There is a difference between a network(defined in networkAttachmentDefenition)
with vlan=0 attribute and a network with no vlan attribute at all.
If the VF has some vlan tag different than 0,
then when it is assigned to a pod (or a vm/i)
which is connected to a network with vlan=0 the vlan tag will be removed.
When the pod is connected to a network with no vlan attribute,
the vlan tag will remain as it is.This behaviour is intended and is implemented this way in SRIOV-CNI.
This is where the vlan is assigned :
https://github.com/k8snetworkplumbingwg/sriov-cni/blob/master/pkg/sriov/sriov.go#L287
The fact that it appears in the code does not mean it is intended.
You can just mention that the SRIOV-CNI is implementing this logic.
And this is where it is released:
https://github.com/k8snetworkplumbingwg/sriov-cni/blob/master/pkg/sriov/sriov.go#L386
I would not ref code, but if you do, ref a permanent link (the current one is not permanent and will not be relevant as soon as the code changes).
As can be seen there, When a VF is released,
the original attributes that were before the assignment are restored.
I think that the source of the ping flakiness is that in our tests,
If you are unsure, it is problematic. You can do something like this: "The investigation found that ...".
the network does not have a vlan defined attribute,
so the original attribute does not get restored,
what can lead to one VF keep it’s vlan tag from the ping-over-sriov+vlan test,
and then on the ping-over-sriov-no-vlan test, one of the VFs have a vlan tag,
and the other one doesn’t , so the ping fails.Another interesting thing is that when the net-attach-def is created by the SRIOV-operator,
after creating a sriovNEtwork object, it will give the net-attach-def a value of 0,
no matter if the sriovNetwork object has a vlan attribute or not.By creating the net-attach-def with a vlan=0 attribute,
the test seems to pass consistently on my local env.
The only mystery left is why this is happening only on the ConnectX-5 machine,
and it will require further investigation.
Please drop on my local env
.
Regarding the mystery, I think we know the answer already. It seems to occur only on the ConnectX-5 SR-IOV NIC, where on numvfs change/reset, the VF config is preserved. This has not been seen on other NIC/s in the CI setups.
Also, please reference the issue you opened on the CNI. |
1824766
to
52d1537
Compare
Done.
Done. If it is intended or not will reveal in the issue I opened in SRIOV-CNI.
Changed to perm links.
Dropped both "local env" and "mystery" part.
I refed it here, but I will add it to the PR |
I am re-reading this and something is wrong here. If a test uses a VLAN, it means it was probably annotated on the pod and eventually together with the NAD, it will reach the CNI with a VLAN, at which time it will configure the VF and on release it will be restored. When you now set VLAN=0 to be the default if no VLAN is specified in the pod annotation, then any value that was there before will get overwritten and solve the problem. Therefore, the open question about what left a VLAN set there is still relevant. I would bet it is the CNI plugin version, but you will need to confirm this. |
52d1537
to
3bc0a9b
Compare
Thanks Eddy for noticing that. My explanation indeed wasn't right. I changed the commit message and the PR description, and I think it is more accurate now. |
3bc0a9b
to
a048a93
Compare
There is a difference between a network (defined in networkAttachmentDefenition)with vlan=0 attribute, and a network with no vlan attribute at all. If the VF has some vlan tag different than 0, then when it is assigned to a pod (or a vmi) which is connected to a network with vlan=0, the vlan tag will be removed. When the pod is connected to a network with no vlan attribute, the vlan tag will remain as it is. This behavior is implemented in SRIOV-CNI. This is where the vlan is assigned : https://github.com/k8snetworkplumbingwg/sriov-cni/blob/b18123d809f1c010cae19018e3395ed5981e76f7/pkg/sriov/sriov.go#L287 The investigation found that the source of the ping flakiness is that in our tests, for some reason, the vlan tag remains from the ping-over-sriov+vlan test. Then on the ping-over-sriov-no-vlan test, because the network does not have a vlan defined attribute, it will not remove the existing vlan tag upon VF assignment. Then, if unfortunately one VMI will be assigned with a tagged VF, it will not remove the tag, and if another VMI assigned with a VF with no vlan tag ,it will lead to a ping failure. Another interesting thing is that when the net-attach-def is created by the SRIOV-operator, after creating a sriovNetwork object, it will give the net-attach-def a value of 0, no matter if the sriovNetwork object has a vlan attribute or not. By creating the net-attach-def with a vlan=0 attribute, the vlan tag is removed from the VF the test seems to pass consistently. Signed-off-by: alonsadan <asadan@redhat.com>
Thank you, there are only two points left:
|
a048a93
to
4b59795
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Thanks Eddy :) @phoracek What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Nice catch!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: phoracek The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
3 similar comments
/retest |
/retest |
/retest |
/retest |
The issue: kubevirt#4642 is resolved by: kubevirt#5093 All the details are in the PR. Since The PR got merged, the flakes were not seen even after dozens of runs. Therefore there is no need for the xfail anymore, and this commit removes it from the SRIOV suit. Signed-off-by: alonsadan <asadan@redhat.com>
The issue: kubevirt#4642 is resolved by: kubevirt#5093 All the details are in the PR. Since The PR got merged, the flakes were not seen even after dozens of runs. Therefore there is no need for the xfail anymore, and this commit removes it from the SRIOV suit. Signed-off-by: alonsadan <asadan@redhat.com>
What this PR does / why we need it:
There is a difference between a network(defined in networkAttachmentDefenition)
with vlan=0 attribute and a network with no vlan attribute at all.
If the VF has some vlan tag different than 0,
then when it is assigned to a pod (or a vm/i)
which is connected to a network with vlan=0 the vlan tag will be removed.
When the pod is connected to a network with no vlan attribute,
the vlan tag will remain as it is.
This behavior is implemented in SRIOV-CNI.
This is where the vlan is assigned :
https://github.com/k8snetworkplumbingwg/sriov-cni/blob/b18123d809f1c010cae19018e3395ed5981e76f7/pkg/sriov/sriov.go#L287
The investigation found that the source of the ping flakiness is that in our tests,
for some reason, the vlan tag remains from the ping-over-sriov+vlan test.
Then on the ping-over-sriov-no-vlan test, because the network does not have a vlan defined attribute,
it will not remove the existing vlan tag upon VF assignment.
Then, if unfortunately one VM/I will be assigned with a tagged VF it will not remove it, and if another VM/I assigned with
a VF with no vlan tag ,it will lead to a ping failure.
There is an open issue regarding this behavior on SRIOV-CNI:
openshift/sriov-cni#25
Another interesting thing is that when the net-attach-def is created by the SRIOV-operator,
after creating a sriovNetwork object, it will give the net-attach-def a value of 0,
no matter if the sriovNetwork object has a vlan attribute or not.
By creating the net-attach-def with a vlan=0 attribute,
the vlan tag is removed from the VF the test seems to pass consistently.
Signed-off-by: alonsadan asadan@redhat.com
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #4642
Special notes for your reviewer:
Release note: