Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[js/webgpu] Support Reshape/Shape 21+ on jsep #21871

Merged
merged 3 commits into from
Aug 27, 2024

Conversation

qjia7
Copy link
Contributor

@qjia7 qjia7 commented Aug 27, 2024

Description

#21618

With this PR, the cross device copying (MemcpyToHost) can totally be removed for model wav2vec2. And the overall time becomes 48ms from 604ms.

Motivation and Context

@qjia7
Copy link
Contributor Author

qjia7 commented Aug 27, 2024

@guschmue @fs-eire @satyajandhyala Please take a look, thanks.

@qjia7 qjia7 changed the title [js/webgpu] Support Reshape 19+ on jsep [js/webgpu] Support Reshape/Shape 21+ on jsep Aug 27, 2024
@fs-eire
Copy link
Contributor

fs-eire commented Aug 27, 2024

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Aug 27, 2024

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Aug 27, 2024

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

2 similar comments
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@fs-eire
Copy link
Contributor

fs-eire commented Aug 27, 2024

thank you for the fix!

there are 2 work items can be the next step:

  • go through all recently updated ONNX operators to make sure no similar issue
  • Some operators do not depend on the actual tensor data at all, like Shape, Reshape, Squeeze, Unsqueeze, Size. ONNX Runtime should never insert MemCpy before those nodes and should not require EPs to register for those operators. Even with those operators registered, ORT still does not work perfectly: because a CPU tensor as input to Shape/Reshape will cause a CPU-to-GPU upload as well. I saw this in a real SLM before (not remember exactly, maybe phi-2)

@guschmue guschmue merged commit 2522220 into microsoft:main Aug 27, 2024
53 checks passed
@gyagp
Copy link

gyagp commented Aug 27, 2024

I just wrote a script to compare WebGPU and CPU ops (script is uploaded as #21879, not for merge). A better way should compare WebGPU and ONNX, but I need more time to explore.
Online references:
CPU: https://github.com/microsoft/onnxruntime/blob/main/docs/OperatorKernels.md#cpuexecutionprovider
WebGPU: https://github.com/microsoft/onnxruntime/blob/main/js/web/docs/webgpu-operators.md
ONNX: https://github.com/onnx/onnx/blob/main/docs/Operators.md

I put results in different categories (2147483647 means +), attached with some comments. Your suggestions are welcome!

[WebGPU EP needs fixes] // PR is needed
==Cast==
CPU: [(6, 12), (13, 18), (19, 20), (21, 2147483647)]
WebGPU: [(6, 8), (9, 12), (13, 18), (19, 2147483647)]
==ReduceMax==
CPU: [(1, 10), (11, 11), (12, 12), (13, 17), (18, 19), (20, 2147483647)]
WebGPU: [(1, 10), (11, 11), (12, 12), (13, 17), (18, 2147483647)]
==ReduceMin==
CPU: [(1, 10), (11, 11), (12, 12), (13, 17), (18, 19), (20, 2147483647)]
WebGPU: [(1, 10), (11, 11), (12, 12), (13, 17), (18, 2147483647)]
==Squeeze==
CPU: [(1, 10), (11, 12), (13, 20), (21, 2147483647)]
WebGPU: [(1, 10), (11, 12), (13, 2147483647)]
==Unsqueeze==
CPU: [(1, 10), (11, 12), (13, 20), (21, 2147483647)]
WebGPU: [(1, 10), (11, 12), (13, 2147483647)]
==Transpose==
CPU: [(1, 12), (13, 20), (21, 2147483647)]
WebGPU: [(1, 12), (13, 2147483647)]
==AveragePool==
CPU: [(7, 9), (10, 10), (11, 18), (19, 2147483647), (1, 2147483647)]
WebGPU: [(7, 9), (10, 10), (11, 2147483647)]
==Flatten==
CPU: [(1, 8), (9, 10), (11, 12), (13, 20), (21, 2147483647)]
WebGPU: [(1, 8), (9, 10), (11, 12), (13, 2147483647)]
==Pad==
CPU: [(2, 10), (11, 12), (13, 17), (18, 18), (19, 20), (21, 2147483647), (1, 2147483647)]
WebGPU: [(2, 10), (11, 12), (13, 17), (18, 18), (19, 2147483647)]
==If==
CPU: [(1, 10), (11, 12), (13, 15), (16, 18), (19, 20), (21, 2147483647)]
WebGPU: [(1, 10), (11, 12), (13, 18), (19, 2147483647)]

[multiple domains] // I think we're fine not to support com.microsoft domain
==Gelu==
CPU: [(20, 2147483647), (1, 2147483647)]
WebGPU: [(20, 2147483647)]
==Conv==
CPU: [(1, 10), (11, 2147483647), (1, 2147483647)]
WebGPU: [(1, 10), (11, 2147483647)]
==Range==
CPU: [(11, 2147483647), (1, 2147483647)]
WebGPU: [(11, 2147483647)]
==DequantizeLinear==
CPU: [(10, 12), (13, 18), (19, 20), (21, 2147483647), (1, 2147483647)]
WebGPU: [(10, 12), (13, 18), (19, 20), (21, 2147483647)]

[Old version ranges are not implemented] // I suppose we're OK not to support old definitions
==Reshape==
CPU: [(1, 4), (5, 12), (13, 13), (14, 18), (19, 20), (21, 2147483647)]
WebGPU: [(5, 12), (13, 13), (14, 18), (19, 20), (21, 2147483647)]
==ThresholdedRelu==
CPU: [(10, 2147483647), (1, 9)]
WebGPU: [(10, 2147483647)]
==DepthToSpace==
CPU: [(1, 10), (11, 12), (13, 2147483647)]
WebGPU: [(11, 12), (13, 2147483647)]
==LayerNormalization==
CPU: [(17, 2147483647), (1, 16)]
WebGPU: [(17, 2147483647)]

[CPU has more version ranges] // Are we fine with this?
==MatMul==
CPU: [(1, 8), (9, 12), (13, 2147483647)]
WebGPU: [(1, 12), (13, 2147483647)]

[WebGPU has more version ranges] // We should be fine to support more ranges in WebGPU
==MaxPool== // looks like a script error
CPU: [(1, 7), (8, 11), (12, 2147483647), (1, 2147483647)]
WebGPU: [(1, 7), (8, 9), (10, 10), (11, 11), (12, 2147483647)]
==Concat==
CPU: [(4, 10), (11, 12), (13, 2147483647)]
WebGPU: [(1, 3), (4, 10), (11, 12), (13, 2147483647)]
==Split==
CPU: [(2, 10), (11, 12), (13, 17), (18, 2147483647)]
WebGPU: [(1, 1), (2, 10), (11, 12), (13, 17), (18, 2147483647)]

@gyagp
Copy link

gyagp commented Aug 27, 2024

@fs-eire @guschmue Please comment the above diff.

@fs-eire
Copy link
Contributor

fs-eire commented Aug 30, 2024

The script helped a lot for reducing the complete operator list to a reasonable length. May need a manual review for the operators on the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants