Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add Running Count to instrumented_io_context #17664

Merged
merged 4 commits into from
Aug 12, 2021

Conversation

lixin-wei
Copy link
Contributor

@lixin-wei lixin-wei commented Aug 8, 2021

Why are these changes needed?

I add a running count to instrumented_io_context in this PR.

This is very helpful when debugging with blocked IO services.

Before:

Global stats: 887 total (1 active)
Queueing time: mean = 113.537 ms, max = 25.225 s, min = 4.301 us, total = 100.707 s
Execution time:  mean = 496.666 us, total = 440.543 ms
Handler stats:
        NodeResourceInfoGcsService.GetResources.reply_callback - 583 total (0 active), CPU time: mean = 45.699 us, total = 26.642 ms
        UNKNOWN - 161 total (0 active), CPU time: mean = 7.145 us, total = 1.150 ms
        RedisCallbackManager.DispatchCallback - 102 total (0 active), CPU time: mean = 213.547 us, total = 21.782 ms
        HeartbeatInfoGcsService.ReportHeartbeat.reply_callback - 22 total (1 active), CPU time: mean = 10.835 ms, total = 238.377 ms

After:

Global stats: 5 total (2 active)
Queueing time: mean = 9.250 us, max = 24.803 us, min = 8.519 us, total = 46.248 us
Execution time:  mean = 6.676 ms, total = 33.380 ms
Handler stats:
        UNKNOWN - 2 total (0 active), CPU time: mean = 42.078 us, total = 84.156 us
        NodeInfoGcsService.GetInternalConfig.reply_callback - 1 total (0 active), CPU time: mean = 33.296 ms, total = 33.296 ms
        FrozenNodeGcsService.GetAllFrozenNodes.reply_callback - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
        [1 RUNNING] NodeInfoGcsService.RegisterNode.reply_callback - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567
Copy link
Contributor

rkooo567 commented Aug 8, 2021

I've thinking (1 active) means it is running. What are the differences here?

@lixin-wei
Copy link
Contributor Author

I've thinking (1 active) means it is running. What are the differences here?

No, (1 active) means there is totally 1 handler in the IO service, including running and pending.

@rkooo567
Copy link
Contributor

rkooo567 commented Aug 9, 2021

Maybe considering to change the format this way?

        [1 RUNNING] NodeInfoGcsService.RegisterNode.reply_callback - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
-> 
NodeInfoGcsService.RegisterNode.reply_callback - 1 total (1 active, 1 running), CPU time: mean = 0.000 s, total = 0.000 s

@lixin-wei
Copy link
Contributor Author

@rkooo567 Yeah, this format is better. I've committed it, please have a look :)

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rkooo567
Copy link
Contributor

Retrying rllib test. @clarkzinzow it'll be great if you can give us final pass

@rkooo567
Copy link
Contributor

cc @lixin-wei can you handle the merge conflict? I will merge once that's done!

@lixin-wei
Copy link
Contributor Author

@rkooo567 done :)

@lixin-wei
Copy link
Contributor Author

lixin-wei commented Aug 11, 2021

BTW, I've looked quickly through this PR which was in conflict with my PR.

I think it's better not to use macros. It's hard to read and debug.

@rkooo567
Copy link
Contributor

@lixin-wei sorry that PR was reverted, which creates a conflict again... I promise to merge this before that PR is re-merged.

cc @iycheng about his comment #17664 (comment).

IIUC, it is kind of difficult to make it work without macro?

@lixin-wei
Copy link
Contributor Author

never mind :) conflict fixed.

@rkooo567
Copy link
Contributor

rkooo567 commented Aug 12, 2021

Let me merge it when it passes CI...

@ericl ericl merged commit d287fc9 into ray-project:master Aug 12, 2021
@lixin-wei lixin-wei deleted the add_running_count branch August 16, 2021 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants