`host_callback.call` fails on multi-gpu machine #5577

C-J-Cundy · 2021-02-01T02:05:35Z

If I run the following code:

from jax.experimental import host_callback
import numpy as np
from jax import pmap, jit, partial, ShapeDtypeStruct


def host_fn(x):
    return x


x = np.ones(4, dtype=np.float32)
host_callback.call(host_fn, x, result_shape=x)

on a 2-gpu machine then it crashes with the error message

2021-01-31 17:53:40.778121: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/outfeed_thunk.cc:56] Check failed: ShapeUtil::Compatible(config_.input_shape, outfeed_buffers->shape()) XLA program outfeed request of shape (f32[4]) did not match the runtime's outfeed buffer of shape u32[2]

If I run with one GPU (by setting CUDA_VISIBLE_DEVICES=0) it finishes with no errors. Is there something I've missed in the documentation for host_callback about how it should be used on multi-device setups?

I ran both with the full debug information
CUDA_VISIBLE_DEVICES=0 TF_CPP_MIN_LOG_LEVEL=0 TF_CPP_VMODULE=outfeed_receiver=3,host_callback=3,outfeed_receiver_py=3,outfeed_thunk=3,xfeed_manager=3 python test_2.py --verbosity=2 2> test_output_one_gpu.txt if that's helpful.
test_output_one_gpu.txt
test_output_two_gpu.txt

The text was updated successfully, but these errors were encountered:

gnecula · 2021-02-02T11:56:32Z

I think that this is not specific to multi-GPU, but can happen even with one GPU (randomly). I think it is related to #4374.

There are two fixes possible: fix the implementation of outfeed for XLA:GPU, or replace the implementation mechanism for GPUs to use CustomCall (this is in progress).

C-J-Cundy · 2021-03-17T23:03:48Z

Is this any closer to being fixed (or, ideas for a workaround?)
host_callback is a really great addition to jax. It's a bit frustrating that it's currently not possible for me to use it with multiple GPUs.

gnecula · 2021-03-18T08:00:19Z

There are two updates. It turns out that the infeed/outfeed in XLA:GPU is not so easy to fix for multi-GPU. So that hope has gotten dimmer.

The second update is more positive, we have a new implementation in the works for GPU, using XLA CustomCall. This means that the host_callback will be synchronous. This implementation was blocked on GPU due to another XLA bug that has been fixed. So the plan is to enable this second implementation mechanism, choosable with an environment variable and command-line flag. This change involves both Python and C++ and will take at least a couple of weeks to land. Sorry for the delay!

AllanChain · 2022-02-07T11:42:43Z

Sorry for the bump, but what's the current status of the second update?

sharadmv · 2022-08-13T17:51:13Z

Sorry for the bump, but what's the current status of the second update?

The custom call on GPU is now landed but not used in host callback quite yet. You can try out the new callback mechanism on GPU with jax.debug.print and we should be porting HCB to use the new custom call very soon.

mattjj · 2022-08-24T19:25:38Z

@C-J-Cundy @AllanChain can you say more about your intended use case? For example, is it to have a callback for a debugging side-effect (like printing), or to perform some functionally pure numerical computation (on the host?), or something else? I ask because if it's one of those two applications we can recommend a replacement API (without having to wait for porting the HCB API to use a new implementation).

PhilipVinc · 2022-08-24T21:54:51Z

functionally pure numerical computation (on the host?),

What would be the replacement API in that case?

sharadmv · 2022-08-25T03:53:00Z

It is jax.pure_callback.

AllanChain · 2022-08-25T06:51:42Z

is it to have a callback for a debugging side-effect (like printing), or to perform some functionally pure numerical computation (on the host?)

Was the former. But I have figured out my problem and not waiting for this anymore.

sharadmv · 2022-08-25T22:56:10Z

For reference, the "callback for a debugging side-effect (like printing)" is jax.debug.callback

gnecula · 2024-09-13T09:36:45Z

Not sure if this is still an issue.

froystig assigned gnecula Feb 1, 2021

froystig added the bug Something isn't working label Feb 1, 2021

clemisch mentioned this issue Mar 18, 2021

id_print example from docs broken; crash on GPU #4374

Closed

gnecula mentioned this issue Apr 15, 2021

host_callback_test.py is failing on multi-GPU platforms #6447

Closed

sudhakarsingh27 added NVIDIA GPU Issues specific to NVIDIA GPUs P0 (urgent) An issue of the highest priority. We are addressing this urgently. (Assignee required) labels Aug 10, 2022

hawkinsp assigned sharadmv Aug 13, 2022

sudhakarsingh27 added P1 (soon) Assignee is working on this now, among other tasks. (Assignee required) and removed P0 (urgent) An issue of the highest priority. We are addressing this urgently. (Assignee required) labels Aug 15, 2022

mattjj added the needs info More information is required to diagnose & prioritize the issue. label Aug 24, 2022

howsiyu mentioned this issue Oct 24, 2023

Multiple chains on Multiple GPU pyro-ppl/numpyro#1046

Closed

gnecula closed this as completed Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`host_callback.call` fails on multi-gpu machine #5577

`host_callback.call` fails on multi-gpu machine #5577

C-J-Cundy commented Feb 1, 2021

gnecula commented Feb 2, 2021

C-J-Cundy commented Mar 17, 2021

gnecula commented Mar 18, 2021

AllanChain commented Feb 7, 2022

sharadmv commented Aug 13, 2022

mattjj commented Aug 24, 2022

PhilipVinc commented Aug 24, 2022

sharadmv commented Aug 25, 2022

AllanChain commented Aug 25, 2022

sharadmv commented Aug 25, 2022

gnecula commented Sep 13, 2024

host_callback.call fails on multi-gpu machine #5577

host_callback.call fails on multi-gpu machine #5577

Comments

C-J-Cundy commented Feb 1, 2021

gnecula commented Feb 2, 2021

C-J-Cundy commented Mar 17, 2021

gnecula commented Mar 18, 2021

AllanChain commented Feb 7, 2022

sharadmv commented Aug 13, 2022

mattjj commented Aug 24, 2022

PhilipVinc commented Aug 24, 2022

sharadmv commented Aug 25, 2022

AllanChain commented Aug 25, 2022

sharadmv commented Aug 25, 2022

gnecula commented Sep 13, 2024

`host_callback.call` fails on multi-gpu machine #5577

`host_callback.call` fails on multi-gpu machine #5577