Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Manually building RTMDet-Inst tensorRT engine fails with internal error #2236

Closed
3 tasks done
mattiasbax opened this issue Jul 3, 2023 · 20 comments
Closed
3 tasks done

Comments

@mattiasbax
Copy link

Checklist

  • I have searched related issues but cannot get the expected help.
  • 2. I have read the FAQ documentation but cannot get the expected help.
  • 3. The bug has not been fixed in the latest version.

Describe the bug

I want to parse the RTMDet-Inst model and build the serialized model from .onnx file representation in order to control which tensorRT version is being used (and not necessarily 8.2 which is default for mmdeploy)

When I try to load the RTMDet-Inst end2end.onnx model created using mmdeploy into a tensorRT python script to build the engine I get the following error:

[TRT] [E] 4: [graphShapeAnalyzer.cpp::nvinfer1::builder::`anonymous-namespace'::ShapeAnalyzerImpl::processCheck::862] Error Code 4: Internal Error (/TopK: K exceeds the maximum value allowed (3840).)

Reproduction

import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)
trt.init_libnvinfer_plugins(logger, '')
runtime = trt.Runtime(logger)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
success = parser.parse_from_file("end2end.onnx")
for idx in range(parser.num_errors):
    print(parser.get_error(idx))

if not success:
    return None # Error handling code here

config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
serialized_engine = builder.build_serialized_network(network, config) # <--- this fails with internal error

Environment

07/03 10:52:46 - mmengine - INFO - **********Environmental information**********
07/03 10:52:49 - mmengine - INFO - sys.platform: win32
07/03 10:52:49 - mmengine - INFO - Python: 3.8.16 (default, Mar  2 2023, 03:18:16) [MSC v.1916 64 bit (AMD64)]
07/03 10:52:49 - mmengine - INFO - CUDA available: True
07/03 10:52:49 - mmengine - INFO - numpy_random_seed: 2147483648
07/03 10:52:49 - mmengine - INFO - GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
07/03 10:52:49 - mmengine - INFO - CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6        
07/03 10:52:49 - mmengine - INFO - NVCC: Cuda compilation tools, release 11.6, V11.6.124
07/03 10:52:49 - mmengine - INFO - MSVC: Microsoft (R) C/C++ Optimizing Compiler Version 19.34.31937 for x64  
07/03 10:52:49 - mmengine - INFO - GCC: n/a
07/03 10:52:49 - mmengine - INFO - PyTorch: 1.13.1+cu116
07/03 10:52:49 - mmengine - INFO - PyTorch compiling details: PyTorch built with:
  - C++ Version: 199711
  - MSVC 192829337
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 2019
  - LAPACK is enabled (usually provided by MKL)        
  - CPU capability usage: AVX2
  - CUDA Runtime 11.6
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
07/03 10:52:49 - mmengine - INFO - OpenCV: 4.7.0
07/03 10:52:49 - mmengine - INFO - MMEngine: 0.7.4
07/03 10:52:49 - mmengine - INFO - MMCV: 2.0.0
07/03 10:52:49 - mmengine - INFO - MMCV Compiler: MSVC 193431937
07/03 10:52:49 - mmengine - INFO - MMCV CUDA Compiler: 11.6
07/03 10:52:49 - mmengine - INFO - MMDeploy: 1.0.0rc3+
07/03 10:52:49 - mmengine - INFO -

07/03 10:52:49 - mmengine - INFO - **********Backend information**********
07/03 10:52:49 - mmengine - INFO - tensorrt:    8.6.1
07/03 10:52:49 - mmengine - INFO - tensorrt custom opsAvailable
07/03 10:52:49 - mmengine - INFO - ONNXRuntime: 1.14.1
07/03 10:52:49 - mmengine - INFO - ONNXRuntime-gpu:   1.14.1
07/03 10:52:49 - mmengine - INFO - ONNXRuntime custom ops:     NotAvailable
07/03 10:52:49 - mmengine - INFO - pplnn:       None
07/03 10:52:49 - mmengine - INFO - ncnn:        None
07/03 10:52:49 - mmengine - INFO - snpe:        None
07/03 10:52:49 - mmengine - INFO - openvino:    None
07/03 10:52:49 - mmengine - INFO - torchscript: 1.13.1+cu116
07/03 10:52:49 - mmengine - INFO - torchscript custom ops:     NotAvailable
07/03 10:52:49 - mmengine - INFO - rknn-toolkit:      None
07/03 10:52:49 - mmengine - INFO - rknn-toolkit2:     None
07/03 10:52:49 - mmengine - INFO - ascend:      None
07/03 10:52:49 - mmengine - INFO - coreml:      None
07/03 10:52:49 - mmengine - INFO - tvm: None
07/03 10:52:49 - mmengine - INFO -

07/03 10:52:49 - mmengine - INFO - **********Codebase information**********
07/03 10:52:49 - mmengine - INFO - mmdet:       3.0.0
07/03 10:52:49 - mmengine - INFO - mmseg:       None
07/03 10:52:49 - mmengine - INFO - mmcls:       None
07/03 10:52:49 - mmengine - INFO - mmocr:       None
07/03 10:52:49 - mmengine - INFO - mmedit:      None
07/03 10:52:49 - mmengine - INFO - mmdet3d:     None
07/03 10:52:49 - mmengine - INFO - mmpose:      None
07/03 10:52:49 - mmengine - INFO - mmrotate:    None
07/03 10:52:49 - mmengine - INFO - mmaction:    None
(trt_export_38)

Error traceback

[TRT] [E] 4: [graphShapeAnalyzer.cpp::nvinfer1::builder::`anonymous-namespace'::ShapeAnalyzerImpl::processCheck::862] Error Code 4: Internal Error (/TopK: K exceeds the maximum value allowed (3840).)
@RunningLeon
Copy link
Collaborator

@mattiasbax hi, how did you get the onnx file? Could you provide a sample code?

@mattiasbax
Copy link
Author

Hi @RunningLeon ,

The onnx file was acquired by the provided deploy.py script, i.e:

python tools/deploy.py configs/mmdet/instance-seg/instance-seg_rtmdet-ins_onnxruntime_static-640x640.py ../mmdetection/configs/rtmdet/rtmdet-ins_s_8xb32-300e_coco.py ../mmdetection/checkpoints/rtmdet-ins_s_8xb32-300e_coco_20221121_212604-fdc5d7ec.pth ../mmdetection/demo/demo.jpg --work-dir ./work_dir/rtmdet-ins-ort-s --device cuda:0 --dump-info

The onnx file has been loaded into a python script using onnxruntime and executed inference on both image and video data successfully.

@RunningLeon
Copy link
Collaborator

@mattiasbax mmdeploy provides torch2onnx with the awareness of the backend configured in deploy configs. So if you want to deploy onto TensorRT, you have to rerun tools/deploy.py with tensorrt deploy config such as configs/mmdet/instance-seg/instance-seg_tensorrt_static-800x1344.py

@mattiasbax
Copy link
Author

@RunningLeon So is there is a difference in the provided .onnx file if I run deploy.py with "instance-seg_tensorrt_static-800x1344.py" and "instance-seg_rtmdet-ins_onnxruntime_static-640x640"? The .engine file I received from running deploy.py with tensorrt_static_xyz.py was not compatible with my tensorRT version, which is why I wanted to build the engine file myself.

@RunningLeon
Copy link
Collaborator

Yes, the onnx files are different. I just tested OK on torch1.10.0+trt8.4.1.5+cuda11.3.

@mattiasbax
Copy link
Author

@RunningLeon I see, thanks.

I just tried with the above script using the other .onnx file provided from using tensorrt_static_xyz.py file. I got these errors:

[07/03/2023-11:51:42] [TRT] [E] 3: getPluginCreator could not find plugin: TRTBatchedNMS version: 1
[07/03/2023-11:51:42] [TRT] [E] ModelImporter.cpp:771: While parsing node number 463 [TRTBatchedNMS -> "/TRTBatchedNMS_output_0"]:
[07/03/2023-11:51:42] [TRT] [E] ModelImporter.cpp:772: --- Begin node ---
[07/03/2023-11:51:42] [TRT] [E] ModelImporter.cpp:773: input: "/Unsqueeze_10_output_0"
input: "/Where_output_0"
output: "/TRTBatchedNMS_output_0"
output: "/TRTBatchedNMS_output_1"
output: "/TRTBatchedNMS_output_2"
name: "/TRTBatchedNMS"
op_type: "TRTBatchedNMS"
attribute {
  i: 0
  type: INT
}
attribute {
  name: "iou_threshold"
  f: 0.6
  type: FLOAT
}
attribute {
  name: "is_normalized"
  i: 0
  type: INT
}
attribute {
  name: "keep_topk"
  i: 100
  type: INT
}
attribute {
  name: "num_classes"
  i: 80
  type: INT
}
attribute {
  name: "return_index"
  i: 1
  type: INT
}
attribute {
  name: "score_threshold"
  f: 0.05
  type: FLOAT
}
attribute {
  name: "topk"
  i: 5000
  type: INT
}
domain: "mmdeploy"

[07/03/2023-11:51:42] [TRT] [E] ModelImporter.cpp:774: --- End node ---
[07/03/2023-11:51:42] [TRT] [E] ModelImporter.cpp:777: ERROR: builtin_op_importers.cpp:5404 In function importFallbackPluginImporter:
[8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"
In node 463 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"

I guess I'm missing some details on how to import the plugins correctly I guess

@RunningLeon
Copy link
Collaborator

hi, pls. rebuild mmdeploy with tensorrt or use docker image https://mmdeploy.readthedocs.io/en/latest/01-how-to-build/build_from_docker.html#use-docker-image

@mattiasbax
Copy link
Author

@RunningLeon Ok, I will try to rebuild mmdeploy. Do I have to build it with tensorRT 8.2.3 or can I build it with later versions?

@mattiasbax
Copy link
Author

Hi @RunningLeon ,

I rebuilt my mmdeploy in accordance to (https://mmdeploy.readthedocs.io/en/latest/01-how-to-build/windows.html) and successfully built the mmdeploy_tensorrt_ops.dll.

I still get the above errors:

[07/04/2023-08:46:44] [TRT] [E] ModelImporter.cpp:779: ERROR: builtin_op_importers.cpp:4870 In function importFallbackPluginImporter:
[8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"
In node 463 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"

How do I use the mmdeploy_tensorrt_ops so that my application can find the operations?

@RunningLeon
Copy link
Collaborator

RunningLeon commented Jul 4, 2023

hi, you could try add the dll to path env

@mattiasbax
Copy link
Author

@RunningLeon I already did that, without any success unfortunately. I even tried moving it directly to the tensorrt lib folder (which is in path env).

Is there anything else I need from the built custom ops? I did not build the SDKs, only the TensorRT custom ops

@RunningLeon
Copy link
Collaborator

hi, pls. check cuda lib cudnn lib tensorrt lib and add them all to the path

@mattiasbax
Copy link
Author

@RunningLeon They are all correctly in the path. Finding cuda, cudnn or tensorrt is not the problem it seems. It seems that it is not loading the DLL that contains the custom ops for the RTMDet-inst model.

Should I try bumping to a later TensorRT version so that I can use load library (https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Plugin/IPluginRegistry.html#tensorrt.IPluginRegistry.load_library) to specify the path to the mmdeploy_tesonrrt_ops.dll?

@mattiasbax
Copy link
Author

Could you maybe send me your built mmdeploy_tesonrrt_ops.dll so I can test if it's the plugin or the environment that is faulty on my end?

@github-actions
Copy link

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

@github-actions github-actions bot added the Stale label Jul 12, 2023
@github-actions
Copy link

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 17, 2023
@Yeah-l
Copy link

Yeah-l commented Aug 21, 2023

I'm experiencing the same error, looking forward to a solution!

@datinje
Copy link

datinje commented Aug 27, 2023

same for me : in my case the inference with python passes but with C++ I am getting the same error
ERROR] 4: [graphShapeAnalyzer.cpp::processCheck::862] Error Code 4: Internal Error (/model/proposal_generator/TopK: K exceeds the maximum value allowed (3840).)
look like a pattern

@RunningLeon
Copy link
Collaborator

same for me : in my case the inference with python passes but with C++ I am getting the same error ERROR] 4: [graphShapeAnalyzer.cpp::processCheck::862] Error Code 4: Internal Error (/model/proposal_generator/TopK: K exceeds the maximum value allowed (3840).) look like a pattern

@datinje hi, could you try this PR #2343?

@SushmaDG
Copy link

SushmaDG commented Jul 2, 2024

I have the same issue when I try to generate TensorRT engine file. I have used the tensorrt deploy config as suggested in the comment above : #2236 (comment)

Error[4]: [graphShapeAnalyzer.cpp::nvinfer1::builder::`anonymous-namespace'::ShapeAnalyzerImpl::processCheck::862] 
Error Code 4: Internal Error (/TopK: K exceeds the maximum value allowed (3840).)

I also have the latest mmdeploy version that includes PR #2343, as suggested by @RunningLeon .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants