-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1.16] ctmap/gc: don't clamp conntrack scan timeout in CI #37380
[v1.16] ctmap/gc: don't clamp conntrack scan timeout in CI #37380
Conversation
Currently, the agent panics if the initial ctmap/nat GC scan does not complete within an hard-coded timeout of 30 seconds. Additionally, this timeout is further reduced to 5 seconds in the Ginkgo tests. There, we are observing quite often failures in the datapath-misc matrix entry, due to agents crashes caused by this timeout being hit (most likely due to resource contention in the CI environment). Let's try avoiding clamping this value in CI, and see if flakiness decreases or there's a more serious issue under the hood. Note that this issue does not affect v1.17 and later, as refactored by fea19ec ("ctmap/gc: do not terminate agent fatal on ctmap gc init after timeout"). Nor it affects v1.14, as the time wrapper had not been introduced yet there. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thank you!
@joestringer fyi, as it's "fallout" from #27253.
Curious do we have any theories why this tends to hit |
I didn't check too much the details, but one explanation could be simply higher resource contention, which makes everything significantly slower. AFAIR that test entry creates a significant number of pods on the test cluster. |
Currently, the agent panics if the initial ctmap/nat GC scan does not complete within an hard-coded timeout of 30 seconds. Additionally, this timeout is further reduced to 5 seconds in the Ginkgo tests. There, we are observing quite often failures in the datapath-misc matrix entry, due to agents crashes caused by this timeout being hit (most likely due to resource contention in the CI environment). Let's try avoiding clamping this value in CI, and see if flakiness decreases or there's a more serious issue under the hood.
Note that this issue does not affect v1.17 and later, as refactored by fea19ec ("ctmap/gc: do not terminate agent fatal on ctmap gc init after timeout"). Nor it affects v1.14, as the time wrapper had not been introduced yet there.