[js/webgpu] Support Reshape/Shape 21+ on jsep #21871

qjia7 · 2024-08-27T04:35:10Z

Description

With this PR, the cross device copying (MemcpyToHost) can totally be removed for model wav2vec2. And the overall time becomes 48ms from 604ms.

Motivation and Context

qjia7 · 2024-08-27T05:18:16Z

@guschmue @fs-eire @satyajandhyala Please take a look, thanks.

fs-eire · 2024-08-27T09:05:14Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

fs-eire · 2024-08-27T09:05:16Z

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline

fs-eire · 2024-08-27T09:05:18Z

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-08-27T09:05:33Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-08-27T09:05:34Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-08-27T09:05:35Z

Azure Pipelines successfully started running 1 pipeline(s).

fs-eire · 2024-08-27T09:16:22Z

thank you for the fix!

there are 2 work items can be the next step:

go through all recently updated ONNX operators to make sure no similar issue
Some operators do not depend on the actual tensor data at all, like Shape, Reshape, Squeeze, Unsqueeze, Size. ONNX Runtime should never insert MemCpy before those nodes and should not require EPs to register for those operators. Even with those operators registered, ORT still does not work perfectly: because a CPU tensor as input to Shape/Reshape will cause a CPU-to-GPU upload as well. I saw this in a real SLM before (not remember exactly, maybe phi-2)

gyagp · 2024-08-27T17:21:12Z

I just wrote a script to compare WebGPU and CPU ops (script is uploaded as #21879, not for merge). A better way should compare WebGPU and ONNX, but I need more time to explore.
Online references:
CPU: https://github.com/microsoft/onnxruntime/blob/main/docs/OperatorKernels.md#cpuexecutionprovider
WebGPU: https://github.com/microsoft/onnxruntime/blob/main/js/web/docs/webgpu-operators.md
ONNX: https://github.com/onnx/onnx/blob/main/docs/Operators.md

I put results in different categories (2147483647 means +), attached with some comments. Your suggestions are welcome!

[WebGPU EP needs fixes] // PR is needed
==Cast==
CPU: [(6, 12), (13, 18), (19, 20), (21, 2147483647)]
WebGPU: [(6, 8), (9, 12), (13, 18), (19, 2147483647)]
==ReduceMax==
CPU: [(1, 10), (11, 11), (12, 12), (13, 17), (18, 19), (20, 2147483647)]
WebGPU: [(1, 10), (11, 11), (12, 12), (13, 17), (18, 2147483647)]
==ReduceMin==
CPU: [(1, 10), (11, 11), (12, 12), (13, 17), (18, 19), (20, 2147483647)]
WebGPU: [(1, 10), (11, 11), (12, 12), (13, 17), (18, 2147483647)]
==Squeeze==
CPU: [(1, 10), (11, 12), (13, 20), (21, 2147483647)]
WebGPU: [(1, 10), (11, 12), (13, 2147483647)]
==Unsqueeze==
CPU: [(1, 10), (11, 12), (13, 20), (21, 2147483647)]
WebGPU: [(1, 10), (11, 12), (13, 2147483647)]
==Transpose==
CPU: [(1, 12), (13, 20), (21, 2147483647)]
WebGPU: [(1, 12), (13, 2147483647)]
==AveragePool==
CPU: [(7, 9), (10, 10), (11, 18), (19, 2147483647), (1, 2147483647)]
WebGPU: [(7, 9), (10, 10), (11, 2147483647)]
==Flatten==
CPU: [(1, 8), (9, 10), (11, 12), (13, 20), (21, 2147483647)]
WebGPU: [(1, 8), (9, 10), (11, 12), (13, 2147483647)]
==Pad==
CPU: [(2, 10), (11, 12), (13, 17), (18, 18), (19, 20), (21, 2147483647), (1, 2147483647)]
WebGPU: [(2, 10), (11, 12), (13, 17), (18, 18), (19, 2147483647)]
==If==
CPU: [(1, 10), (11, 12), (13, 15), (16, 18), (19, 20), (21, 2147483647)]
WebGPU: [(1, 10), (11, 12), (13, 18), (19, 2147483647)]

[multiple domains] // I think we're fine not to support com.microsoft domain
==Gelu==
CPU: [(20, 2147483647), (1, 2147483647)]
WebGPU: [(20, 2147483647)]
==Conv==
CPU: [(1, 10), (11, 2147483647), (1, 2147483647)]
WebGPU: [(1, 10), (11, 2147483647)]
==Range==
CPU: [(11, 2147483647), (1, 2147483647)]
WebGPU: [(11, 2147483647)]
==DequantizeLinear==
CPU: [(10, 12), (13, 18), (19, 20), (21, 2147483647), (1, 2147483647)]
WebGPU: [(10, 12), (13, 18), (19, 20), (21, 2147483647)]

[Old version ranges are not implemented] // I suppose we're OK not to support old definitions
==Reshape==
CPU: [(1, 4), (5, 12), (13, 13), (14, 18), (19, 20), (21, 2147483647)]
WebGPU: [(5, 12), (13, 13), (14, 18), (19, 20), (21, 2147483647)]
==ThresholdedRelu==
CPU: [(10, 2147483647), (1, 9)]
WebGPU: [(10, 2147483647)]
==DepthToSpace==
CPU: [(1, 10), (11, 12), (13, 2147483647)]
WebGPU: [(11, 12), (13, 2147483647)]
==LayerNormalization==
CPU: [(17, 2147483647), (1, 16)]
WebGPU: [(17, 2147483647)]

[CPU has more version ranges] // Are we fine with this?
==MatMul==
CPU: [(1, 8), (9, 12), (13, 2147483647)]
WebGPU: [(1, 12), (13, 2147483647)]

[WebGPU has more version ranges] // We should be fine to support more ranges in WebGPU
==MaxPool== // looks like a script error
CPU: [(1, 7), (8, 11), (12, 2147483647), (1, 2147483647)]
WebGPU: [(1, 7), (8, 9), (10, 10), (11, 11), (12, 2147483647)]
==Concat==
CPU: [(4, 10), (11, 12), (13, 2147483647)]
WebGPU: [(1, 3), (4, 10), (11, 12), (13, 2147483647)]
==Split==
CPU: [(2, 10), (11, 12), (13, 17), (18, 2147483647)]
WebGPU: [(1, 1), (2, 10), (11, 12), (13, 17), (18, 2147483647)]

gyagp · 2024-08-27T17:22:08Z

@fs-eire @guschmue Please comment the above diff.

fs-eire · 2024-08-30T21:23:24Z

The script helped a lot for reducing the complete operator list to a reasonable length. May need a manual review for the operators on the list.

qjia7 added 2 commits August 27, 2024 12:24

[js/webgpu] Support Reshape 19+ on jsep

e35ad4b

expand to 21+

8273e6f

support Shape 21+ on jsep

ff9a499

qjia7 changed the title ~~[js/webgpu] Support Reshape 19+ on jsep~~ [js/webgpu] Support Reshape/Shape 21+ on jsep Aug 27, 2024

qjia7 mentioned this pull request Aug 27, 2024

[Web] Wav2vec2 slower on WebGPU than WASM #21618

Closed

fs-eire approved these changes Aug 27, 2024

View reviewed changes

guschmue approved these changes Aug 27, 2024

View reviewed changes

guschmue merged commit 2522220 into microsoft:main Aug 27, 2024
53 checks passed

fs-eire mentioned this pull request Aug 30, 2024

[js/webgpu] Compare WebGPU/CPU op diff #21879

Draft

fs-eire mentioned this pull request Oct 24, 2024

[Web/JSEP] Update operator registrations for latest onnx opset #22592

Closed

10 tasks

prathikr mentioned this pull request Oct 24, 2024

[JSEP] Upgrade to ONNX Opset 21 #22595

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[js/webgpu] Support Reshape/Shape 21+ on jsep #21871

[js/webgpu] Support Reshape/Shape 21+ on jsep #21871

qjia7 commented Aug 27, 2024 •

edited

Loading

qjia7 commented Aug 27, 2024

fs-eire commented Aug 27, 2024

fs-eire commented Aug 27, 2024

fs-eire commented Aug 27, 2024

azure-pipelines bot commented Aug 27, 2024

azure-pipelines bot commented Aug 27, 2024

azure-pipelines bot commented Aug 27, 2024

fs-eire commented Aug 27, 2024 •

edited

Loading

gyagp commented Aug 27, 2024

gyagp commented Aug 27, 2024

fs-eire commented Aug 30, 2024

[js/webgpu] Support Reshape/Shape 21+ on jsep #21871

[js/webgpu] Support Reshape/Shape 21+ on jsep #21871

Conversation

qjia7 commented Aug 27, 2024 • edited Loading

Description

Motivation and Context

qjia7 commented Aug 27, 2024

fs-eire commented Aug 27, 2024

fs-eire commented Aug 27, 2024

fs-eire commented Aug 27, 2024

azure-pipelines bot commented Aug 27, 2024

azure-pipelines bot commented Aug 27, 2024

azure-pipelines bot commented Aug 27, 2024

fs-eire commented Aug 27, 2024 • edited Loading

gyagp commented Aug 27, 2024

gyagp commented Aug 27, 2024

fs-eire commented Aug 30, 2024

qjia7 commented Aug 27, 2024 •

edited

Loading

fs-eire commented Aug 27, 2024 •

edited

Loading