-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some models Failed on Windows CPU #568
Comments
Hi @mszhanyi, But for FCN_ResNet_50 and FCN_ResNet_101, they are not quantized models... I was wondering what are the errors for them? |
@jcwchen
from the log https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=789802&view=results |
…13407) ### Description 1. update model name structure in model_tests.cpp with source name. To avoid `Condition test_param_names.count(param_name) == 0 failed. Duplicate parameterized test name 'BERT_Squad_opset10_CPU'` 2. skip some failed models onnx/models#568 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
Based on this line, it seems that their output mismatch even with considering tolerance. As I mentioned above, -int8 and -qdq models will produce different output without VNNI support and originally their test data in ONNX Model Zoo were produced by machines with VNNI support in CPU. But for FCN_ResNet_50 and FCN_ResNet_101, their output should be correct and reproducible. I just tested them with my onnxruntime==1.12.0 (CPU ep by default) and the original output can pass the inferred one. May I understand what kind of execution provider these tests were using? |
Is this an onnxruntime specific discussion? Then it would be better to move it there. |
@jcwchen Thank your reply. And I find you said Did you create the test data with I updated my office desktop (DELL Precision 5820) last year, which has avx512vnni too. |
@jcwchen Some Azure machines have VNNI. I can give you one. |
Sorry I shouldn't just say VNNI support which causes confusion. To clarify: All -int8 and -qdq models in ONNX Model Zoo were generated by machines with This is the CPU flags for GitHub Action in ONNX Model Zoo CI:
|
There are access violation SEH occurring. This means it crashes. It is not a C++ exception. |
***@***.*** ***@***.***> @sheil ***@***.***>
I think this is related to the failures we have seen before in WinML builds.
Some models Failed on Windows CPU * Issue #568 * onnx/models (github.com)<#568>
…--
Dmitri
From: Chun-Wei Chen ***@***.***>
Sent: Friday, October 28, 2022 8:01
To: onnx/models ***@***.***>
Cc: Dmitri Smirnov ***@***.***>; Comment ***@***.***>
Subject: Re: [onnx/models] Some models Failed on Windows CPU (Issue #568)
Sorry I shouldn't just say VNNI support which causes confusion. To clarify: All -int8 and -qdq models in ONNX Model Zoo were generated by machines with avx512f support and they also passed CI (GitHub Action machines sometimes do have avx512f support) in ONNX Model Zoo by running ORT CPU ep with uploaded test_data_set. That's why I was wondering if the running CI you have, which has these output mismatch failures, has avx512f support or not.
This is the CPU flags for GitHub Action in ONNX Model Zoo CI:
Flags: 3dnow, 3dnowprefetch, abm, adx, aes, apic, avx, avx2, avx512bw, avx512cd, avx512dq, avx512f, avx512vl, bmi1, bmi2, clflush, clflushopt, cmov, cx16, cx8, de, dts, erms, f16c, fma, fpu, fxsr, hle, ht, hypervisor, ia64, invpcid, lahf_lm, mca, mce, mmx, movbe, msr, mtrr, osxsave, pae, pat, pcid, pclmulqdq, pge, pni, popcnt, pse, pse36, rdrnd, rdseed, rtm, sep, serial, smap, smep, ss, sse, sse2, sse4_1, sse4_2, ssse3, tm, tsc, vme, xsave
-
Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fonnx%2Fmodels%2Fissues%2F568%23issuecomment-1295106699&data=05%7C01%7Cdmitrism%40microsoft.com%7C036133005d6e40836b3908dab8f53869%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638025660596859897%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EE3B%2FWcLKmRWYaNuDy%2FNgDZ3i%2FVJhUX8CSiMfPL2Yqw%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACWHYNHUY2FDVQKWQWTGLQDWFPTCTANCNFSM6AAAAAARLYDYEU&data=05%7C01%7Cdmitrism%40microsoft.com%7C036133005d6e40836b3908dab8f53869%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638025660596859897%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=acyCysuu0hGDvd8VIV7p5jqlT2TSpPUsUHbamBfv8KA%3D&reserved=0>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
|
@yufenglee , if I understand correctly, this indicates we have bugs in some kernels that make uses of VNNI. @jcwchen 's machines do not have VNNI so actually they didn't run the int8 models in int8 mode. But ORT team's machines now have VNNI, so they caught the error. |
@yuslepukhin , the error mszhanyi showed to us was generated from a CPU machine. Though it used winml, but it didn't use DirectML. So error would have come from our CPU EP. |
@snnn, this is not a bug. Quantization with U8S8 or S8S8 can saturate on machine without VNNI when computing quantized MatMul and Conv: https://onnxruntime.ai/docs/performance/quantization.html#when-and-why-do-i-need-to-try-u8u8. We added a session option to avoid the issue for quantization with QDQ format but it leads a worse latency: https://github.com/microsoft/onnxruntime/blob/0b0c51e02890ea35d0e5023681f2362421e44ceb/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h#L101 For the test case, we should generate the output on VNNI or with this option on, and also run the model tests with the option enabled on machine without VNNI. |
@jcwchen , what do you think? |
…13407) ### Description 1. update model name structure in model_tests.cpp with source name. To avoid `Condition test_param_names.count(param_name) == 0 failed. Duplicate parameterized test name 'BERT_Squad_opset10_CPU'` 2. skip some failed models onnx/models#568 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
Thank you everyone for the helpful pointers. I previously thought all of existing test_data_set for -int8.onnx and -qdq.onnx models are all generated by machines with VNNI support, but I am not very sure. At least merged models after this PR: #526 should be. @mengniwang95 May I ask you is it correct that all existing test_data_set you have uploaded for -int8.onnx and -qdq.onnx models are all generated by machines with VNNI support? |
I run some experiments locally with my machine without VNNI support and enable
@yufenglee please let me know if I misused or misunderstood. Thank you! Going forward I will try to run further experiments on machines with VNNI support. @snnn I will reach out to you to learn how to get a VNNI machine from you next week. Thank you for letting me know. |
all existing test_data_set for -int8/qdq.onnx models are generated with no-VNNI support |
For FCN_ResNet, I've opened a new issue in onnxruntime microsoft/onnxruntime#13509 |
That may be true, but it is also a fact that it only takes place in Winml builds. |
Thank you @snnn for providing machines with VNNI support. I have just run a few experiments. I created several test cases in a non-VNNI machine with enabling If output in VNNI machines cannot reproduce from non-VNNI machine (@yufenglee please correct me if I am wrong. Thank you!), I slightly tend to still keep original output which were generated from non-VNNI machines due to two reasons:
If ONNX Model Zoo still keeps original output data, perhaps in ORT testing we can regenerate those test data in a VNNI machine on the fly or just skip them for now. In this repo, we can add some description about this behavior difference between VNNI machines and non-VNNI machines for quantized ONNX models to prevent confusion. If anyone has other concern, feel free to bring up. Thanks! |
Here VNNI means 8-bit support. If your machine does not have 8-bit support, I think you should not use it to generate or test 8-bit models. |
I thought there might be use cases for 8-bit models on non-VNNI machines, but I could be wrong since I am not really familiar with quantization. @mengniwang95 since you and your team are the main contributors for quantized models in ONNX Model Zoo (Thank you for the contribution!), may I understand what's your opinion about making test_data_set for quantized models be generated by VNNI machines? Do you have VNNI machines to generate them? Thanks! |
Generally speaking, for the same model and same inputs, I don't think it's wrong if different hardware may generate different outputs. There is no unique answer for machine learning tasks, but we need to define what kind of differences are tolerable. In this case, how differ they are? Do we think the results generated on non-VNNI machines are correct or not? |
@jcwchen I have VNNI machine to generate test_data_set. In my opinion, if you have VNNI machine to do pre-ci test, it is okay to upload VNNI test_data_set, otherwise there is no necessary. Since VNNI int8 model will get different outputs on VNNI machine and non-VNNI machine with same input. |
@mengniwang95 Thank you for the feedback!
The difference of ORT inference result between VNNI machine and non-VNNI machine for quantized models is too significant. I am not sure whether the result generated from non-VNNI machines are reasonable, but anyway the result produced by VNNI machines should be accurate and more reliable. It makes sense to me that providing results by VNNI machines is better to prevent user's confusion. I have updated those failed output with a VNNI machine by this PR: #572. Going forward for newer check-in quantized models, ideally we should provide output result by VNNI machines as well. However, my only concern is current CI cannot verify output generated from VNNI machines... For now, these output for quantized models will be skipped by the CIs and I can manually test the new PR from my end with a local VNNI machine. To make it automatic, we will need a self-hosted machine with VNNI support in GitHub Action. I will create an issue to track this work item. |
…13407) ### Description 1. update model name structure in model_tests.cpp with source name. To avoid `Condition test_param_names.count(param_name) == 0 failed. Duplicate parameterized test name 'BERT_Squad_opset10_CPU'` 2. skip some failed models onnx/models#568 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
Ask a Question
Question
"VGG_16_int8_opset12_zoo_CPU"
"SSD_int8_opset12_zoo_CPU"
"ShuffleNet_v2_int8_opset12_zoo_CPU"
"ResNet50_int8_opset12_zoo_CPU"
"ResNet50_qdq_opset12_zoo_CPU"
"MobileNet_v2_1_0_qdq_opset12_zoo_CPU"
"MobileNet_v2_1_0_int8_opset12_zoo_CPU"
"Inception_1_int8_opset12_zoo_CPU"
"Faster_R_CNN_R_50_FPN_int8_opset12_zoo_CPU"
"BERT_Squad_int8_opset12_zoo_CPU"
"EfficientNet_Lite4_qdq_opset11_zoo_CPU"
"EfficientNet_Lite4_int8_opset11_zoo_CPU"
"FCN_ResNet_50_opset11_zoo_CPU"
"FCN_ResNet_101_opset11_zoo_CPU"
Are these known issue?
Exception Message
The text was updated successfully, but these errors were encountered: