Skip to content

Caffe2 operation switches current CUDA stream #40020

Open
@sublee

Description

@sublee

🐛 Bug

A Caffe2 operation in PyTorch does not respect the current CUDA stream. An operation switches the current CUDA stream with its own CUDA stream. As results:

  • All the CUDA kernels since the Caffe2 operation are executed on the switched CUDA stream.
  • A Caffe2 operation would provide a corrupted result nondeterministically if it depends on a tensor from the original CUDA stream in PyTorch.

To Reproduce

Steps to reproduce the behavior:

  1. Check the current CUDA stream in PyTorch.
  2. Call any Caffe2 operation in PyTorch.
  3. Check the current CUDA stream in PyTorch again.
# The current CUDA stream is the default stream.
>>> torch.cuda.current_stream()
<torch.cuda.Stream device=cuda:0 cuda_stream=0x0>

>>> x = torch.rand(1).cuda()
>>> torch.ops._caffe2.AliasWithName(x, 'x')
tensor([0.8422], device='cuda:0')

# After a Caffe2 operation, the current CUDA stream has been changed.
>>> torch.cuda.current_stream()
<torch.cuda.Stream device=cuda:0 cuda_stream=0x561aad064c40>

Expected behavior

The second CUDA stream should be the same as the first CUDA stream.

Environment

I reproduced this issue on both PyTorch 1.5.0 and 1.4.0.

PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 418.116.00
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.5.0
[pip] torchvision==0.6.0a0+82fd1c8
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.1.243             h6bb024c_0
[conda] mkl                       2020.0                      166
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.15           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] numpy                     1.18.1           py37h4f9e942_0
[conda] numpy-base                1.18.1           py37hde5b4d6_1
[conda] pytorch                   1.5.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchvision               0.6.0                py37_cu101    pytorch

Additional context

I found this issue when I'm trying to deploy Faster R-CNN from Detectron2. The first output of the deployed model is nondeterministically different with the latter outputs.

model = torch.jit.load('model.ts')
y1 = model(x)
y2 = model(x)
y3 = model(x)
y1 != y2
y2 == y3

The model calls torch.ops._caffe2.GenerateProposals() internally (code). I profiled this function call and the depending function call rpn_head(). Then I could see the unexpected CUDA stream switch at GenerateProposals:

image

One of possible workarounds is to call an arbitrary Caffe2 operation at the beginning of the model:

def forward(self, input):
    torch.ops._caffe2.AliasWithName(input, 'input')
    return self.original_model(input)

cc @ngimel

Activity

added
module: cudaRelated to torch.cuda, and CUDA support in general
triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
on Jun 15, 2020
glaringlee

glaringlee commented on Jun 15, 2020

@glaringlee
Contributor

@sublee
Can you switch to pytorch apis? Caffe2 is being deprecated and migrating into pytorch codebase.

sublee

sublee commented on Jun 16, 2020

@sublee
ContributorAuthor

Thanks for the reply. Because of to use Caffe2 is a choice of Detectron2 (I don't understand "why"), I guess it's hard to switch to proper PyTorch APIs. If this issue is a Caffe2+PyTorch bug but it won't be fixed, I would fix my case by the workaround I introduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    caffe2module: cudaRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Caffe2 operation switches current CUDA stream · Issue #40020 · pytorch/pytorch