Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some models Failed on Windows CPU #568

Closed
mszhanyi opened this issue Oct 22, 2022 · 24 comments · Fixed by #572
Closed

Some models Failed on Windows CPU #568

mszhanyi opened this issue Oct 22, 2022 · 24 comments · Fixed by #572
Labels

Comments

@mszhanyi
Copy link

Ask a Question

Question

"VGG_16_int8_opset12_zoo_CPU"
"SSD_int8_opset12_zoo_CPU"
"ShuffleNet_v2_int8_opset12_zoo_CPU"
"ResNet50_int8_opset12_zoo_CPU"
"ResNet50_qdq_opset12_zoo_CPU"
"MobileNet_v2_1_0_qdq_opset12_zoo_CPU"
"MobileNet_v2_1_0_int8_opset12_zoo_CPU"
"Inception_1_int8_opset12_zoo_CPU"
"Faster_R_CNN_R_50_FPN_int8_opset12_zoo_CPU"
"BERT_Squad_int8_opset12_zoo_CPU"
"EfficientNet_Lite4_qdq_opset11_zoo_CPU"
"EfficientNet_Lite4_int8_opset11_zoo_CPU"
"FCN_ResNet_50_opset11_zoo_CPU"
"FCN_ResNet_101_opset11_zoo_CPU"

Are these known issue?

Exception Message

2022-10-21T14:37:12.6254993Z 6: [  FAILED  ] ModelTests/ModelTest.Run/VGG_16_int8_opset12zoo_CPU, where GetParam() = (00000202C3999A30, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6255920Z 6: [  FAILED  ] ModelTests/ModelTest.Run/SSD_int8_opset12zoo_CPU, where GetParam() = (00000202C39989F0, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6256645Z 6: [  FAILED  ] ModelTests/ModelTest.Run/ShuffleNet_v2_int8_opset12zoo_CPU, where GetParam() = (00000202C3998D10, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6257265Z 6: [  FAILED  ] ModelTests/ModelTest.Run/ResNet50_int8_opset12zoo_CPU, where GetParam() = (00000202C3998E50, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6257873Z 6: [  FAILED  ] ModelTests/ModelTest.Run/ResNet50_qdq_opset12zoo_CPU, where GetParam() = (00000202C3999CB0, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6258699Z 6: [  FAILED  ] ModelTests/ModelTest.Run/MobileNet_v2_1_0_qdq_opset12zoo_CPU, where GetParam() = (00000202C3998A90, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6259382Z 6: [  FAILED  ] ModelTests/ModelTest.Run/MobileNet_v2_1_0_int8_opset12zoo_CPU, where GetParam() = (00000202C3999530, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6260101Z 6: [  FAILED  ] ModelTests/ModelTest.Run/Inception_1_int8_opset12zoo_CPU, where GetParam() = (00000202C3998DB0, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6260846Z 6: [  FAILED  ] ModelTests/ModelTest.Run/Faster_R_CNN_R_50_FPN_int8_opset12zoo_CPU, where GetParam() = (00000202C3998770, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6261476Z 6: [  FAILED  ] ModelTests/ModelTest.Run/BERT_Squad_int8_opset12zoo_CPU, where GetParam() = (00000202CF60B8E0, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6262077Z 6: [  FAILED  ] ModelTests/ModelTest.Run/FCN_ResNet_50_opset11zoo_CPU, where GetParam() = (0000020281105630, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6262690Z 6: [  FAILED  ] ModelTests/ModelTest.Run/FCN_ResNet_101_opset11zoo_CPU, where GetParam() = (0000020281105E50, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6263319Z 6: [  FAILED  ] ModelTests/ModelTest.Run/EfficientNet_Lite4_qdq_opset11zoo_CPU, where GetParam() = (0000020281106A30, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6263957Z 6: [  FAILED  ] ModelTests/ModelTest.Run/EfficientNet_Lite4_int8_opset11zoo_CPU, where GetParam() = (00000202811054F0, 4-byte object <01-00 00-00>)
@jcwchen
Copy link
Member

jcwchen commented Oct 23, 2022

Hi @mszhanyi,
Thanks for the report. What kinds of failure are these? Are they output mismatch? Regarding -int8 or -qdq models, their output will be possibly different with different CPU such as with/without VNNI support. (See #522)

But for FCN_ResNet_50 and FCN_ResNet_101, they are not quantized models... I was wondering what are the errors for them?

@mszhanyi
Copy link
Author

mszhanyi commented Oct 23, 2022

@jcwchen
The error message is

2022-10-21T14:25:40.4140241Z 6: [ RUN      ] ModelTests/ModelTest.Run/FCN_ResNet_50_opset11zoo_CPU
2022-10-21T14:25:41.5542157Z 6: D:\a\_work\1\s\winml\test\model\model_tests.cpp(131): error: Expected: results = session.Evaluate(binding, L"Testing") doesn't throw an exception.
2022-10-21T14:25:41.5549681Z 6:   Actual: it throws.
2022-10-21T14:25:41.5911396Z 6: unknown file: error: SEH exception with code 0xc0000005 thrown in the test body.
2022-10-21T14:25:41.5912289Z 6: [  FAILED  ] ModelTests/ModelTest.Run/FCN_ResNet_50_opset11zoo_CPU, where GetParam() = (0000020281105630, 4-byte object <01-00 00-00>) (1177 ms)

from the log
https://dev.azure.com/onnxruntime/2a773b67-e88b-4c7f-9fc0-87d31fea8ef2/_apis/build/builds/789802/logs/23

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=789802&view=results

mszhanyi added a commit to microsoft/onnxruntime that referenced this issue Oct 25, 2022
…13407)

### Description
1. update model name structure in model_tests.cpp with source name. To
avoid
`Condition test_param_names.count(param_name) == 0 failed. Duplicate
parameterized test name 'BERT_Squad_opset10_CPU'`
2. skip some failed models onnx/models#568


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@jcwchen
Copy link
Member

jcwchen commented Oct 25, 2022

Based on this line, it seems that their output mismatch even with considering tolerance. As I mentioned above, -int8 and -qdq models will produce different output without VNNI support and originally their test data in ONNX Model Zoo were produced by machines with VNNI support in CPU.

But for FCN_ResNet_50 and FCN_ResNet_101, their output should be correct and reproducible. I just tested them with my onnxruntime==1.12.0 (CPU ep by default) and the original output can pass the inferred one. May I understand what kind of execution provider these tests were using?

@yuslepukhin
Copy link

Is this an onnxruntime specific discussion? Then it would be better to move it there.

@mszhanyi
Copy link
Author

mszhanyi commented Oct 28, 2022

@jcwchen Thank your reply.
The onnxruntime Windows CI is running on Azure Dsv5, the cpu flags are
avx512bitalg, avx512bw, avx512cd, avx512dq, avx512f, avx512ifma, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vpopcntdq

And I find you said Unfortunately I don't have machines with avx512_vnni support so I am not sure whether machines with avx512_vnni and machines with avx512f (and without avx512_vnni)

Did you create the test data with avx512f? I guess that may be the reason.

I updated my office desktop (DELL Precision 5820) last year, which has avx512vnni too.
cc @snnn

@snnn
Copy link
Contributor

snnn commented Oct 28, 2022

@jcwchen Some Azure machines have VNNI. I can give you one.

@jcwchen
Copy link
Member

jcwchen commented Oct 28, 2022

Sorry I shouldn't just say VNNI support which causes confusion. To clarify: All -int8 and -qdq models in ONNX Model Zoo were generated by machines with avx512f support and they also passed CI (GitHub Action machines sometimes do have avx512f support) in ONNX Model Zoo by running ORT CPU ep with uploaded test_data_set. That's why I was wondering if the running CI you have, which has these output mismatch failures, has avx512f support or not.

This is the CPU flags for GitHub Action in ONNX Model Zoo CI:

Flags: 3dnow, 3dnowprefetch, abm, adx, aes, apic, avx, avx2, avx512bw, avx512cd, avx512dq, avx512f, avx512vl, bmi1, bmi2, clflush, clflushopt, cmov, cx16, cx8, de, dts, erms, f16c, fma, fpu, fxsr, hle, ht, hypervisor, ia64, invpcid, lahf_lm, mca, mce, mmx, movbe, msr, mtrr, osxsave, pae, pat, pcid, pclmulqdq, pge, pni, popcnt, pse, pse36, rdrnd, rdseed, rtm, sep, serial, smap, smep, ss, sse, sse2, sse4_1, sse4_2, ssse3, tm, tsc, vme, xsave

@yuslepukhin
Copy link

yuslepukhin commented Oct 28, 2022

There are access violation SEH occurring. This means it crashes. It is not a C++ exception.
This has been happening for some time in the packing pipeline and only with WinML builds.

@yuslepukhin
Copy link

yuslepukhin commented Oct 28, 2022 via email

@snnn
Copy link
Contributor

snnn commented Oct 28, 2022

@yufenglee , if I understand correctly, this indicates we have bugs in some kernels that make uses of VNNI. @jcwchen 's machines do not have VNNI so actually they didn't run the int8 models in int8 mode. But ORT team's machines now have VNNI, so they caught the error.

@snnn
Copy link
Contributor

snnn commented Oct 28, 2022

@yuslepukhin , the error mszhanyi showed to us was generated from a CPU machine. Though it used winml, but it didn't use DirectML. So error would have come from our CPU EP.

@yufenglee
Copy link

@yufenglee , if I understand correctly, this indicates we have bugs in some kernels that make uses of VNNI. @jcwchen 's machines do not have VNNI so actually they didn't run the int8 models in int8 mode. But ORT team's machines now have VNNI, so they caught the error.

@snnn, this is not a bug. Quantization with U8S8 or S8S8 can saturate on machine without VNNI when computing quantized MatMul and Conv: https://onnxruntime.ai/docs/performance/quantization.html#when-and-why-do-i-need-to-try-u8u8.

We added a session option to avoid the issue for quantization with QDQ format but it leads a worse latency: https://github.com/microsoft/onnxruntime/blob/0b0c51e02890ea35d0e5023681f2362421e44ceb/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h#L101

For the test case, we should generate the output on VNNI or with this option on, and also run the model tests with the option enabled on machine without VNNI.

@snnn
Copy link
Contributor

snnn commented Oct 28, 2022

For the test case, we should generate the output on VNNI or with this option on,

@jcwchen , what do you think?

linnealovespie pushed a commit to microsoft/onnxruntime that referenced this issue Oct 28, 2022
…13407)

### Description
1. update model name structure in model_tests.cpp with source name. To
avoid
`Condition test_param_names.count(param_name) == 0 failed. Duplicate
parameterized test name 'BERT_Squad_opset10_CPU'`
2. skip some failed models onnx/models#568


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@jcwchen
Copy link
Member

jcwchen commented Oct 29, 2022

Thank you everyone for the helpful pointers. I previously thought all of existing test_data_set for -int8.onnx and -qdq.onnx models are all generated by machines with VNNI support, but I am not very sure. At least merged models after this PR: #526 should be.

@mengniwang95 May I ask you is it correct that all existing test_data_set you have uploaded for -int8.onnx and -qdq.onnx models are all generated by machines with VNNI support?

@jcwchen
Copy link
Member

jcwchen commented Oct 29, 2022

I run some experiments locally with my machine without VNNI support and enable so.add_session_config_entry("session.x64quantprecision", "1") for InferenceSession, but the result seems similar:

  • Original model+test_data_set which CANNOT pass (output mismatch) my local previously. With enabling session.x64quantprecision, the original output still mismatches: yolov3-12-int8.tar.gz.
  • Original model+test_data_se which CAN pass (output matches) my local previously. With enabling session.x64quantprecision, the original output still matches: VGG_16_int8.tar.gz, MaskRCNN-12-int8.tar.gz.

@yufenglee please let me know if I misused or misunderstood. Thank you!

Going forward I will try to run further experiments on machines with VNNI support. @snnn I will reach out to you to learn how to get a VNNI machine from you next week. Thank you for letting me know.

@mengniwang95
Copy link
Contributor

are all generated by machines with VNNI support, but I am not very sure. At least merged models after this PR: #526 should be.

@mengniwang95 May I ask you is it correct that all existing test_data_set you have uploaded for -int8.onnx and -qdq.onnx models are all generated by machines with VNNI support?

all existing test_data_set for -int8/qdq.onnx models are generated with no-VNNI support

@mszhanyi
Copy link
Author

mszhanyi commented Oct 31, 2022

@yuslepukhin , the error mszhanyi showed to us was generated from a CPU machine. Though it used winml, but it didn't use DirectML. So error would have come from our CPU EP.

For FCN_ResNet, I've opened a new issue in onnxruntime microsoft/onnxruntime#13509

@yuslepukhin
Copy link

@yuslepukhin , the error mszhanyi showed to us was generated from a CPU machine. Though it used winml, but it didn't use DirectML. So error would have come from our CPU EP.

That may be true, but it is also a fact that it only takes place in Winml builds.

@jcwchen
Copy link
Member

jcwchen commented Nov 2, 2022

Thank you @snnn for providing machines with VNNI support. I have just run a few experiments.

I created several test cases in a non-VNNI machine with enabling session.x64quantprecision+ .qdq.onnx or -int8.onnx models, but they still encounter same output mismatch errors in VNNI machines.

If output in VNNI machines cannot reproduce from non-VNNI machine (@yufenglee please correct me if I am wrong. Thank you!), I slightly tend to still keep original output which were generated from non-VNNI machines due to two reasons:

  1. Contributors won't always have VNNI support machines. Not sure whether it is reasonable to ask them to generate the test data for -int8 or -qdq models with VNNI machines though.
  2. Current GitHub Action CI in this repo does not have VNNI support anyway. If the output were generated from VNNI machines, current CI in ONNX Model Zoo cannot validate them (always output mismatch).

If ONNX Model Zoo still keeps original output data, perhaps in ORT testing we can regenerate those test data in a VNNI machine on the fly or just skip them for now. In this repo, we can add some description about this behavior difference between VNNI machines and non-VNNI machines for quantized ONNX models to prevent confusion. If anyone has other concern, feel free to bring up. Thanks!

@snnn
Copy link
Contributor

snnn commented Nov 3, 2022

Here VNNI means 8-bit support. If your machine does not have 8-bit support, I think you should not use it to generate or test 8-bit models.

@jcwchen
Copy link
Member

jcwchen commented Nov 4, 2022

Here VNNI means 8-bit support. If your machine does not have 8-bit support, I think you should not use it to generate or test 8-bit models.

I thought there might be use cases for 8-bit models on non-VNNI machines, but I could be wrong since I am not really familiar with quantization.

@mengniwang95 since you and your team are the main contributors for quantized models in ONNX Model Zoo (Thank you for the contribution!), may I understand what's your opinion about making test_data_set for quantized models be generated by VNNI machines? Do you have VNNI machines to generate them? Thanks!

@snnn
Copy link
Contributor

snnn commented Nov 4, 2022

Generally speaking, for the same model and same inputs, I don't think it's wrong if different hardware may generate different outputs. There is no unique answer for machine learning tasks, but we need to define what kind of differences are tolerable. In this case, how differ they are? Do we think the results generated on non-VNNI machines are correct or not?

@mengniwang95
Copy link
Contributor

@jcwchen I have VNNI machine to generate test_data_set. In my opinion, if you have VNNI machine to do pre-ci test, it is okay to upload VNNI test_data_set, otherwise there is no necessary. Since VNNI int8 model will get different outputs on VNNI machine and non-VNNI machine with same input.

@jcwchen
Copy link
Member

jcwchen commented Nov 9, 2022

@mengniwang95 Thank you for the feedback!

Generally speaking, for the same model and same inputs, I don't think it's wrong if different hardware may generate different outputs. There is no unique answer for machine learning tasks, but we need to define what kind of differences are tolerable. In this case, how differ they are? Do we think the results generated on non-VNNI machines are correct or not?

The difference of ORT inference result between VNNI machine and non-VNNI machine for quantized models is too significant. I am not sure whether the result generated from non-VNNI machines are reasonable, but anyway the result produced by VNNI machines should be accurate and more reliable. It makes sense to me that providing results by VNNI machines is better to prevent user's confusion. I have updated those failed output with a VNNI machine by this PR: #572. Going forward for newer check-in quantized models, ideally we should provide output result by VNNI machines as well.

However, my only concern is current CI cannot verify output generated from VNNI machines... For now, these output for quantized models will be skipped by the CIs and I can manually test the new PR from my end with a local VNNI machine. To make it automatic, we will need a self-hosted machine with VNNI support in GitHub Action. I will create an issue to track this work item.

nums11 pushed a commit to microsoft/onnxruntime that referenced this issue May 23, 2023
…13407)

### Description
1. update model name structure in model_tests.cpp with source name. To
avoid
`Condition test_param_names.count(param_name) == 0 failed. Duplicate
parameterized test name 'BERT_Squad_opset10_CPU'`
2. skip some failed models onnx/models#568


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants