Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.16] ctmap/gc: don't clamp conntrack scan timeout in CI #37380

Merged
merged 1 commit into from
Feb 10, 2025

Conversation

giorio94
Copy link
Member

Currently, the agent panics if the initial ctmap/nat GC scan does not complete within an hard-coded timeout of 30 seconds. Additionally, this timeout is further reduced to 5 seconds in the Ginkgo tests. There, we are observing quite often failures in the datapath-misc matrix entry, due to agents crashes caused by this timeout being hit (most likely due to resource contention in the CI environment). Let's try avoiding clamping this value in CI, and see if flakiness decreases or there's a more serious issue under the hood.

Note that this issue does not affect v1.17 and later, as refactored by fea19ec ("ctmap/gc: do not terminate agent fatal on ctmap gc init after timeout"). Nor it affects v1.14, as the time wrapper had not been introduced yet there.

Currently, the agent panics if the initial ctmap/nat GC scan does not
complete within an hard-coded timeout of 30 seconds. Additionally, this
timeout is further reduced to 5 seconds in the Ginkgo tests. There,
we are observing quite often failures in the datapath-misc matrix entry,
due to agents crashes caused by this timeout being hit (most likely due
to resource contention in the CI environment). Let's try avoiding
clamping this value in CI, and see if flakiness decreases or there's
a more serious issue under the hood.

Note that this issue does not affect v1.17 and later, as refactored by
fea19ec ("ctmap/gc: do not terminate agent fatal on ctmap gc init
after timeout"). Nor it affects v1.14, as the time wrapper had not been
introduced yet there.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added release-note/ci This PR makes changes to the CI. needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Jan 31, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.16 This PR represents a backport for Cilium 1.16.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jan 31, 2025
@giorio94 giorio94 changed the title ctmap/gc: don't clamp conntrack scan timeout in CI [v1.16] ctmap/gc: don't clamp conntrack scan timeout in CI Jan 31, 2025
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 marked this pull request as ready for review January 31, 2025 10:36
@giorio94 giorio94 requested a review from a team as a code owner January 31, 2025 10:36
@julianwiedmann julianwiedmann requested review from a team and jibi and removed request for a team January 31, 2025 10:46
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thank you!

@joestringer fyi, as it's "fallout" from #27253.

@julianwiedmann julianwiedmann added this pull request to the merge queue Feb 10, 2025
Merged via the queue into cilium:v1.16 with commit a0a5571 Feb 10, 2025
67 checks passed
@joestringer
Copy link
Member

Curious do we have any theories why this tends to hit datapath-misc more frequently? Does that test populate the CT map more or something?

@giorio94
Copy link
Member Author

Curious do we have any theories why this tends to hit datapath-misc more frequently? Does that test populate the CT map more or something?

I didn't check too much the details, but one explanation could be simply higher resource contention, which makes everything significantly slower. AFAIR that test entry creates a significant number of pods on the test cluster.

@nbusseneau nbusseneau mentioned this pull request Feb 14, 2025
2 tasks
@nbusseneau nbusseneau added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Feb 14, 2025
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.16 This PR represents a backport for Cilium 1.16.x of a PR that was merged to main. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. feature/conntrack kind/backports This PR provides functionality previously merged into master. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants