[performance] - Remove bulk and streams use cases from UO and TO. Scallability test added. #10138

see-quick · 2024-05-21T14:01:01Z

Type of change

Enhancement / new feature

Description

This PR focuses on exploring the impact of different configurations on the efficiency of creating, modifying, and deleting Kafka topics. I've played around with a range of batch sizes and linger durations to see how they affect performance across different scales of topic counts.

Based on this graph (KRaft):

One can see that I have tried multiple configurations with batch sizes and linger settings stretching from 1ms to 2000ms. Moreover, the range of topics which I tested was from 50 to 1000 to see some pattern if such configuration scaling well or if there are some problems....(that could be viewed on each curve). This could help us understand the capabilities of UTO with various settings and scale with the best configuration.

I have also implemented the way how we create the events. Currently, we are doing it sequentially and now I have modified it and used ExecutorService to manage and process batches concurrently. More on that in the Javadoc...

[1] - #10050 (review)

Update (19.9.2024):

After a few modifications, also we have decided to remove two use cases from TO and UO (i.e., Alice bulk and bob's streaming). We do not think it adds much value so we would currently stick to capacity and scalability tests, which are now present in those test suites.

Checklist

Write tests
Make sure all tests pass

see-quick · 2024-05-21T14:01:49Z

/packit test --labels performance-topic-operator-capacity

see-quick · 2024-05-21T15:48:39Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testCapacityCreateAndUpdateTopics --env=STRIMZI_USE_KRAFT_IN_TESTS=true

strimzi-ci · 2024-05-21T15:48:51Z

▶️ Build started - check Jenkins for more info. ▶️

strimzi-ci · 2024-05-21T18:14:49Z

❌ Test Summary ❌

TEST_PROFILE: performance
GROUPS:
TEST_CASE: TopicOperatorPerformance#testCapacityCreateAndUpdateTopics
TOTAL: 6
PASS: 0
FAIL: 6
SKIP: 0
BUILD_NUMBER: 79
OCP_VERSION: 4.15
BUILD_IMAGES: false
FIPS_ENABLED: false
PARALLEL_COUNT: 5
EXCLUDED_GROUPS: loadbalancer,nodeport,olm
ENV_VARIABLES: STRIMZI_USE_KRAFT_IN_TESTS=true

❗ Test Failures ❗

testCapacityCreateAndUpdateTopics[6] 1000, 2000 in io.strimzi.systemtest.performance.TopicOperatorPerformance

Re-run command:
@strimzi-ci run tests --profile=performance --testcase=io.strimzi.systemtest.performance.TopicOperatorPerformance#testCapacityCreateAndUpdateTopics[6] 1000, 2000

see-quick · 2024-05-23T12:05:32Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testCapacityCreateAndUpdateTopics --env=STRIMZI_USE_KRAFT_IN_TESTS=true

strimzi-ci · 2024-05-23T12:05:43Z

▶️ Build started - check Jenkins for more info. ▶️

strimzi-ci · 2024-05-23T14:08:34Z

❌ Test Summary ❌

TEST_PROFILE: performance
GROUPS:
TEST_CASE: TopicOperatorPerformance#testCapacityCreateAndUpdateTopics
TOTAL: 6
PASS: 0
FAIL: 6
SKIP: 0
BUILD_NUMBER: 80
OCP_VERSION: 4.15
BUILD_IMAGES: false
FIPS_ENABLED: false
PARALLEL_COUNT: 5
EXCLUDED_GROUPS: loadbalancer,nodeport,olm
ENV_VARIABLES: STRIMZI_USE_KRAFT_IN_TESTS=true

❗ Test Failures ❗

testCapacityCreateAndUpdateTopics[6] 1000, 2000 in io.strimzi.systemtest.performance.TopicOperatorPerformance

Re-run command:
@strimzi-ci run tests --profile=performance --testcase=io.strimzi.systemtest.performance.TopicOperatorPerformance#testCapacityCreateAndUpdateTopics[6] 1000, 2000

fvaleri

Hi @see-quick, thanks for working on this.

In order to simulate a busy shared cluster and possibly catch some edge cases, I think we should try to include all 3 kind of topic events (creations, updates and deletes) and run them in parallel.

In my custom test, I'm taking the number of events I want to test as input, then I divide them by 3 to get the number of tasks I have to run in parallel (you would have 1/2 spare events that you can simply consume as noop, that's fine). Each task executes topic creation, update (partition increase and config change) and deletion serially. Wdyt?

see-quick · 2024-05-27T07:46:10Z

Hi @see-quick, thanks for working on this.

In order to simulate a busy shared cluster and possibly catch some edge cases, I think we should try to include all 3 kind of topic events (creations, updates and deletes) and run them in parallel.

In my custom test, I'm taking the number of events I want to test as input, then I divide them by 3 to get the number of tasks I have to run in parallel (you would have 1/2 spare events that you can simply consume as noop, that's fine). Each task executes topic creation, update (partition increase and config change) and deletion serially. Wdyt?

Okay, but that way we would not be able to see the upper bound (i.e., how many KafkaTopics is TO able to handle in creation and modification). Maybe such information is not so important....and if we make all these three operations what is our termination condition? Do we want to create a specific number of topics (e.g., 1000) and see how TO is performing on different configurations? What are then the most important OUT metrics to check? Also, should we execute these tasks incrementally and divide them into batches (i.e., every 100 KafkaTopics?) as we do capacity or should we run all 1000 topics at once?

fvaleri · 2024-05-27T09:04:02Z

we would not be able to see the upper bound (i.e., how many KafkaTopics is TO able to handle in creation and modification)

I think the objective here is not see the upper bound, but to assess performance in a fixed size of events. For example, I'm running test with the following batch of events: 50, 100, 150, ..., 1000. That way, you see how it scales, by simply putting the end-to-end reconciliation (we only care about this one here) time on a line graph, and you can compare with a previous implementation on the very same graph.

With e2e reconciliation time in seconds I mean the time from creation/update to ready, or deletion duration. This is how an example graph looks like (note: we only need number, then you can generate the graph with whatever tool you prefer):

see-quick · 2024-05-27T15:10:19Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

see-quick · 2024-05-28T14:18:11Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

strimzi-ci · 2024-05-28T14:18:25Z

▶️ Build started - check Jenkins for more info. ▶️

strimzi-ci · 2024-05-29T11:12:17Z

❗ Systemtests Failed (no tests results are present) ❗

see-quick · 2024-05-29T12:49:53Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

strimzi-ci · 2024-05-29T12:50:04Z

▶️ Build started - check Jenkins for more info. ▶️

systemtest/src/main/java/io/strimzi/systemtest/Environment.java

strimzi-ci · 2024-05-30T01:06:45Z

✔️ Test Summary ✔️

TEST_PROFILE: null
GROUPS: null
TEST_CASE: null
TOTAL: 120
PASS: 120
FAIL: 0
SKIP: 0
BUILD_NUMBER: 82
OCP_VERSION: null
BUILD_IMAGES: false
FIPS_ENABLED: null
PARALLEL_COUNT: null
ENV_VARIABLES: STRIMZI_USE_KRAFT_IN_TESTS=true

see-quick · 2024-05-30T07:49:23Z

we would not be able to see the upper bound (i.e., how many KafkaTopics is TO able to handle in creation and modification)

I think the objective here is not see the upper bound, but to assess performance in a fixed size of events. For example, I'm running test with the following batch of events: 50, 100, 150, ..., 1000. That way, you see how it scales, by simply putting the end-to-end reconciliation (we only care about this one here) time on a line graph, and you can compare with a previous implementation on the very same graph.

With e2e reconciliation time in seconds I mean the time from creation/update to ready, or deletion duration. This is how an example graph looks like (note: we only need number, then you can generate the graph with whatever tool you prefer):

So I have tried 6 configurations here:

a) with internal metric - strimzi max reconciliation

b) with external metric - duration of all operations (i.e., create, modify and delete + readiness)

fvaleri

@see-quick nice work.

I left some improvement suggestions, but the base logic is there.

I would also try with BS 100 and LMS 10.

systemtest/src/test/java/io/strimzi/systemtest/performance/TopicOperatorPerformance.java

...est/src/main/java/io/strimzi/systemtest/performance/utils/TopicOperatorPerformanceUtils.java

see-quick · 2024-06-03T09:53:14Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

strimzi-ci · 2024-06-03T09:53:26Z

▶️ Build started - check Jenkins for more info. ▶️

Signed-off-by: see-quick <maros.orsak159@gmail.com>

see-quick · 2024-09-18T08:43:03Z

/packit test --labels performance

Signed-off-by: see-quick <maros.orsak159@gmail.com>

.packit.yaml

Signed-off-by: see-quick <maros.orsak159@gmail.com>

see-quick · 2024-09-19T07:31:15Z

/packit test --labels performance

Signed-off-by: see-quick <maros.orsak159@gmail.com>

see-quick · 2024-09-19T11:16:43Z

/packit test --labels performance

Signed-off-by: see-quick <maros.orsak159@gmail.com>

see-quick · 2024-09-19T12:53:11Z

/packit test --labels performance

Signed-off-by: see-quick <maros.orsak159@gmail.com>

see-quick · 2024-09-19T18:26:39Z

/packit test --labels performance

im-konge

LGTM, thanks

systemtest/tmt/plans/main.fmf

Signed-off-by: see-quick <maros.orsak159@gmail.com>

henryZrncik

LGTM, thanks for PR!

Frawless

Just several nits

systemtest/src/main/java/io/strimzi/systemtest/performance/PerformanceConstants.java

...src/main/java/io/strimzi/systemtest/performance/report/TopicOperatorPerformanceReporter.java

...est/src/main/java/io/strimzi/systemtest/performance/utils/TopicOperatorPerformanceUtils.java

Signed-off-by: see-quick <maros.orsak159@gmail.com>

see-quick · 2024-09-23T09:44:47Z

/packit test --labels performance

see-quick self-assigned this May 21, 2024

see-quick added the performance label May 21, 2024

see-quick added this to the 0.42.0 milestone May 21, 2024

fvaleri reviewed May 25, 2024

View reviewed changes

see-quick added the topic operator label May 28, 2024

see-quick commented May 29, 2024

View reviewed changes

systemtest/src/main/java/io/strimzi/systemtest/Environment.java Outdated Show resolved Hide resolved

see-quick changed the title ~~[performance] - TopicOperator capacity create & modify~~ [performance] - Topic Operator Impact of Batch Size and Linger Settings on Kafka Topic Operations May 30, 2024

see-quick marked this pull request as ready for review May 30, 2024 09:10

see-quick requested a review from a team May 30, 2024 21:26

see-quick added the needs review label May 30, 2024

fvaleri reviewed May 31, 2024

View reviewed changes

Lukas review

cbe3e47

Signed-off-by: see-quick <maros.orsak159@gmail.com>

see-quick force-pushed the topic-oeprator-perf-capacity branch from e50f635 to cbe3e47 Compare September 18, 2024 08:29

update performance to release jobs

d9a5556

Signed-off-by: see-quick <maros.orsak159@gmail.com>

remove some not needed jobs

df43bb1

Signed-off-by: see-quick <maros.orsak159@gmail.com>

Frawless approved these changes Sep 18, 2024

View reviewed changes

.packit.yaml Outdated Show resolved Hide resolved

Frawless self-requested a review September 18, 2024 17:12

empty line

e53c45c

Signed-off-by: see-quick <maros.orsak159@gmail.com>

im-konge marked this pull request as ready for review September 19, 2024 08:07

remove un-nedeed use cases

ce106aa

Signed-off-by: see-quick <maros.orsak159@gmail.com>

see-quick changed the title ~~[performance] - Topic Operator Impact of Batch Size and Linger Settings on Kafka Topic Operations~~ [performance] - Remove bulk and streams use cases from UO and TO. Scallability test added. Sep 19, 2024

update from perf-capacity to capacity

4e180a3

Signed-off-by: see-quick <maros.orsak159@gmail.com>

remove performance tag from suites

9fb30fe

Signed-off-by: see-quick <maros.orsak159@gmail.com>

im-konge approved these changes Sep 19, 2024

View reviewed changes

Frawless reviewed Sep 20, 2024

View reviewed changes

systemtest/tmt/plans/main.fmf Show resolved Hide resolved

see-quick added 2 commits September 20, 2024 09:25

update common performance tests

926d2fc

Signed-off-by: see-quick <maros.orsak159@gmail.com>

update duration of the jobs

8d47651

Signed-off-by: see-quick <maros.orsak159@gmail.com>

henryZrncik approved these changes Sep 23, 2024

View reviewed changes

Frawless approved these changes Sep 23, 2024

View reviewed changes

Jakub review

6c544be

Signed-off-by: see-quick <maros.orsak159@gmail.com>

see-quick removed the needs review label Sep 23, 2024

Frawless approved these changes Sep 23, 2024

View reviewed changes

see-quick added the user operator label Sep 23, 2024

see-quick merged commit 10c074b into strimzi:main Sep 23, 2024
15 checks passed

see-quick mentioned this pull request Sep 23, 2024

[system test] Create a new Tags class #10624

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[performance] - Remove bulk and streams use cases from UO and TO. Scallability test added. #10138

[performance] - Remove bulk and streams use cases from UO and TO. Scallability test added. #10138

see-quick commented May 21, 2024 •

edited

Loading

see-quick commented May 21, 2024

see-quick commented May 21, 2024

strimzi-ci commented May 21, 2024

strimzi-ci commented May 21, 2024

see-quick commented May 23, 2024

strimzi-ci commented May 23, 2024

strimzi-ci commented May 23, 2024

fvaleri left a comment

see-quick commented May 27, 2024 •

edited

Loading

fvaleri commented May 27, 2024

see-quick commented May 27, 2024

see-quick commented May 28, 2024

strimzi-ci commented May 28, 2024

strimzi-ci commented May 29, 2024

see-quick commented May 29, 2024

strimzi-ci commented May 29, 2024

strimzi-ci commented May 30, 2024

see-quick commented May 30, 2024 •

edited

Loading

fvaleri left a comment •

edited

Loading

see-quick commented Jun 3, 2024

strimzi-ci commented Jun 3, 2024

see-quick commented Sep 18, 2024

see-quick commented Sep 19, 2024

see-quick commented Sep 19, 2024

see-quick commented Sep 19, 2024

see-quick commented Sep 19, 2024

im-konge left a comment

henryZrncik left a comment

Frawless left a comment

see-quick commented Sep 23, 2024

[performance] - Remove bulk and streams use cases from UO and TO. Scallability test added. #10138

[performance] - Remove bulk and streams use cases from UO and TO. Scallability test added. #10138

Conversation

see-quick commented May 21, 2024 • edited Loading

Type of change

Description

Checklist

see-quick commented May 21, 2024

see-quick commented May 21, 2024

strimzi-ci commented May 21, 2024

strimzi-ci commented May 21, 2024

❌ Test Summary ❌

❗ Test Failures ❗

see-quick commented May 23, 2024

strimzi-ci commented May 23, 2024

strimzi-ci commented May 23, 2024

❌ Test Summary ❌

❗ Test Failures ❗

fvaleri left a comment

Choose a reason for hiding this comment

see-quick commented May 27, 2024 • edited Loading

fvaleri commented May 27, 2024

see-quick commented May 27, 2024

see-quick commented May 28, 2024

strimzi-ci commented May 28, 2024

strimzi-ci commented May 29, 2024

see-quick commented May 29, 2024

strimzi-ci commented May 29, 2024

strimzi-ci commented May 30, 2024

✔️ Test Summary ✔️

see-quick commented May 30, 2024 • edited Loading

fvaleri left a comment • edited Loading

Choose a reason for hiding this comment

see-quick commented Jun 3, 2024

strimzi-ci commented Jun 3, 2024

see-quick commented Sep 18, 2024

see-quick commented Sep 19, 2024

see-quick commented Sep 19, 2024

see-quick commented Sep 19, 2024

see-quick commented Sep 19, 2024

im-konge left a comment

Choose a reason for hiding this comment

henryZrncik left a comment

Choose a reason for hiding this comment

Frawless left a comment

Choose a reason for hiding this comment

see-quick commented Sep 23, 2024

see-quick commented May 21, 2024 •

edited

Loading

see-quick commented May 27, 2024 •

edited

Loading

see-quick commented May 30, 2024 •

edited

Loading

fvaleri left a comment •

edited

Loading