Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thread dump collection to report.sh #10895

Merged
merged 1 commit into from
Dec 3, 2024
Merged

Conversation

fvaleri
Copy link
Contributor

@fvaleri fvaleri commented Dec 2, 2024

Type of change

  • Enhancement / new feature

Description

This patch adds thread dump collection to the report.sh tool. Unlike heap dumps, thread dumps are lightweight (~100ms), but still can help diagnose problems and better optimize application and JVM performance. Tested with both ZK and KRaft based clusters.

Checklist

  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally

@fvaleri
Copy link
Contributor Author

fvaleri commented Dec 2, 2024

Sample output:
report-02-12-2024_11-23-15.zip

@fvaleri fvaleri requested a review from showuon December 2, 2024 09:33
tools/report.sh Outdated Show resolved Hide resolved
@fvaleri fvaleri requested a review from a team December 2, 2024 09:53
@fvaleri fvaleri changed the title Add Kafka thread dump collection to report.sh Add thread dump collection to report.sh Dec 2, 2024
@fvaleri fvaleri force-pushed the threads-report branch 2 times, most recently from a7b3567 to b68f595 Compare December 2, 2024 10:35
@fvaleri fvaleri requested a review from showuon December 2, 2024 10:36
@fvaleri fvaleri force-pushed the threads-report branch 2 times, most recently from f7c5b71 to e1c354d Compare December 2, 2024 10:41
@scholzj
Copy link
Member

scholzj commented Dec 2, 2024

I think there are two possible challenges:

  • It breaks the formatting of the logs which will likely make any analysis based on complete logs from a logging stack more complicated. This is likely not a blocker as other things might break it as well (e.g. our startup logging is a bit clear as it does not inject in the middle of a log, but it still doe not have the proper format). Similarly enabled GC logging would break this. But something to keep in mind.
  • It will use part of the log buffer and thus it might mean the collected logs will not contain as much information from the regular logs (for a basic Kafka cluster without any real clients this seems to do over 1000 lines in the log output - not sure how it grows). Again, I guess this is not a blocker ... it might matter a lotin some cases and not at all in others 🤷 .

As I said, probably not blockers. But it makes me wonder if this should be an optional thing enabled by some option?

@showuon
Copy link
Member

showuon commented Dec 3, 2024

IMO, the thread dump should be collected by default, not optional, since you never know if this information useful or not. How about we collect logs twice: 1 without thread dump, 2 with thread dump. Even if there are duplicated logs, it should be fine for better investigation. That should also resolve all the concerns above. WDYT?

@scholzj
Copy link
Member

scholzj commented Dec 3, 2024

That partially mitigates the second issue - but only partially because the thread dump will be still in the buffer consuming the space. Doesn't do anything with the first issue.

Have you instead considered some better solution that actually addresses the issues? For example something like this?

kubectl exec <PodName> -ti -- /bin/bash -c 'for proc in $(ls -1 /proc/ | grep [0-9]); do if echo "$(ls -lh /proc/$proc/exe 2>/dev/null || true)" | grep -q java; then jcmd $proc Thread.print; fi; done'

@showuon
Copy link
Member

showuon commented Dec 3, 2024

I have thought about jcmd, but it also output the thread dump in the logs as kill -3 does. Unless we're accepting to install jstack or other troubleshooting tools in pod, we might need to include them in logs. IMO, we don't need to install other tools for this, appending in logs is enough.

@scholzj
Copy link
Member

scholzj commented Dec 3, 2024

No, it prints it in its own stdout (i.e. the one of the kubectl exec). Also, jcmd seems to be present in the Strimzi containers.

@fvaleri
Copy link
Contributor Author

fvaleri commented Dec 3, 2024

@scholzj thanks for the jcmd suggestion. I'm using a different command with jcmd, which I think it is easier to grasp and maintain. I'm also creating a dedicated folder for thread dumps. Sample output attached.
report-03-12-2024_12-02-11.zip

@scholzj scholzj added this to the 0.45.0 milestone Dec 3, 2024
@scholzj scholzj requested a review from ppatierno December 3, 2024 11:14
This patch adds thread dump collection to the report.sh tool.
Unlike heap dumps, thread dumps are lightweight (~100ms), but still can help diagnose problems and better optimize application and JVM performance.
Tested with both ZK and KRaft based clusters.

Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
Copy link
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks @scholzj @fvaleri !

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@scholzj scholzj merged commit 5de7726 into strimzi:main Dec 3, 2024
13 checks passed
@fvaleri fvaleri deleted the threads-report branch December 3, 2024 13:22
OwenCorrigan76 pushed a commit to OwenCorrigan76/strimzi-kafka-operator that referenced this pull request Dec 6, 2024
Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants