[v1.16] ctmap/gc: don't clamp conntrack scan timeout in CI #37380

giorio94 · 2025-01-31T10:14:53Z

Currently, the agent panics if the initial ctmap/nat GC scan does not complete within an hard-coded timeout of 30 seconds. Additionally, this timeout is further reduced to 5 seconds in the Ginkgo tests. There, we are observing quite often failures in the datapath-misc matrix entry, due to agents crashes caused by this timeout being hit (most likely due to resource contention in the CI environment). Let's try avoiding clamping this value in CI, and see if flakiness decreases or there's a more serious issue under the hood.

Note that this issue does not affect v1.17 and later, as refactored by fea19ec ("ctmap/gc: do not terminate agent fatal on ctmap gc init after timeout"). Nor it affects v1.14, as the time wrapper had not been introduced yet there.

Currently, the agent panics if the initial ctmap/nat GC scan does not complete within an hard-coded timeout of 30 seconds. Additionally, this timeout is further reduced to 5 seconds in the Ginkgo tests. There, we are observing quite often failures in the datapath-misc matrix entry, due to agents crashes caused by this timeout being hit (most likely due to resource contention in the CI environment). Let's try avoiding clamping this value in CI, and see if flakiness decreases or there's a more serious issue under the hood. Note that this issue does not affect v1.17 and later, as refactored by fea19ec ("ctmap/gc: do not terminate agent fatal on ctmap gc init after timeout"). Nor it affects v1.14, as the time wrapper had not been introduced yet there. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2025-01-31T10:24:25Z

/test

julianwiedmann

makes sense, thank you!

@joestringer fyi, as it's "fallout" from #27253.

joestringer · 2025-02-12T19:31:22Z

Curious do we have any theories why this tends to hit datapath-misc more frequently? Does that test populate the CT map more or something?

giorio94 · 2025-02-13T08:10:49Z

Curious do we have any theories why this tends to hit datapath-misc more frequently? Does that test populate the CT map more or something?

I didn't check too much the details, but one explanation could be simply higher resource contention, which makes everything significantly slower. AFAIR that test entry creates a significant number of pods on the test cluster.

giorio94 added release-note/ci needs-backport/1.15 labels Jan 31, 2025

maintainer-s-little-helper bot added backport/1.16 kind/backports labels Jan 31, 2025

giorio94 changed the title ~~ctmap/gc: don't clamp conntrack scan timeout in CI~~ [v1.16] ctmap/gc: don't clamp conntrack scan timeout in CI Jan 31, 2025

giorio94 marked this pull request as ready for review January 31, 2025 10:36

giorio94 requested a review from a team as a code owner January 31, 2025 10:36

julianwiedmann added the feature/conntrack label Jan 31, 2025

julianwiedmann requested review from a team and jibi and removed request for a team January 31, 2025 10:46

julianwiedmann approved these changes Feb 10, 2025

View reviewed changes

julianwiedmann added this pull request to the merge queue Feb 10, 2025

Merged via the queue into cilium:v1.16 with commit a0a5571 Feb 10, 2025
67 checks passed

nbusseneau mentioned this pull request Feb 14, 2025

v1.15 Backports 2025-02-14 #37646

Merged

2 tasks

nbusseneau added backport-pending/1.15 and removed needs-backport/1.15 labels Feb 14, 2025

github-actions bot added backport-done/1.15 and removed backport-pending/1.15 labels Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.16] ctmap/gc: don't clamp conntrack scan timeout in CI #37380

[v1.16] ctmap/gc: don't clamp conntrack scan timeout in CI #37380

giorio94 commented Jan 31, 2025

giorio94 commented Jan 31, 2025

julianwiedmann left a comment

joestringer commented Feb 12, 2025

giorio94 commented Feb 13, 2025

[v1.16] ctmap/gc: don't clamp conntrack scan timeout in CI #37380

[v1.16] ctmap/gc: don't clamp conntrack scan timeout in CI #37380

Conversation

giorio94 commented Jan 31, 2025

giorio94 commented Jan 31, 2025

julianwiedmann left a comment

Choose a reason for hiding this comment

joestringer commented Feb 12, 2025

giorio94 commented Feb 13, 2025