Some models Failed on Windows CPU #568

mszhanyi · 2022-10-22T08:34:51Z

Ask a Question

Question

"VGG_16_int8_opset12_zoo_CPU"
"SSD_int8_opset12_zoo_CPU"
"ShuffleNet_v2_int8_opset12_zoo_CPU"
"ResNet50_int8_opset12_zoo_CPU"
"ResNet50_qdq_opset12_zoo_CPU"
"MobileNet_v2_1_0_qdq_opset12_zoo_CPU"
"MobileNet_v2_1_0_int8_opset12_zoo_CPU"
"Inception_1_int8_opset12_zoo_CPU"
"Faster_R_CNN_R_50_FPN_int8_opset12_zoo_CPU"
"BERT_Squad_int8_opset12_zoo_CPU"
"EfficientNet_Lite4_qdq_opset11_zoo_CPU"
"EfficientNet_Lite4_int8_opset11_zoo_CPU"
"FCN_ResNet_50_opset11_zoo_CPU"
"FCN_ResNet_101_opset11_zoo_CPU"

Are these known issue?

Exception Message

2022-10-21T14:37:12.6254993Z 6: [  FAILED  ] ModelTests/ModelTest.Run/VGG_16_int8_opset12zoo_CPU, where GetParam() = (00000202C3999A30, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6255920Z 6: [  FAILED  ] ModelTests/ModelTest.Run/SSD_int8_opset12zoo_CPU, where GetParam() = (00000202C39989F0, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6256645Z 6: [  FAILED  ] ModelTests/ModelTest.Run/ShuffleNet_v2_int8_opset12zoo_CPU, where GetParam() = (00000202C3998D10, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6257265Z 6: [  FAILED  ] ModelTests/ModelTest.Run/ResNet50_int8_opset12zoo_CPU, where GetParam() = (00000202C3998E50, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6257873Z 6: [  FAILED  ] ModelTests/ModelTest.Run/ResNet50_qdq_opset12zoo_CPU, where GetParam() = (00000202C3999CB0, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6258699Z 6: [  FAILED  ] ModelTests/ModelTest.Run/MobileNet_v2_1_0_qdq_opset12zoo_CPU, where GetParam() = (00000202C3998A90, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6259382Z 6: [  FAILED  ] ModelTests/ModelTest.Run/MobileNet_v2_1_0_int8_opset12zoo_CPU, where GetParam() = (00000202C3999530, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6260101Z 6: [  FAILED  ] ModelTests/ModelTest.Run/Inception_1_int8_opset12zoo_CPU, where GetParam() = (00000202C3998DB0, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6260846Z 6: [  FAILED  ] ModelTests/ModelTest.Run/Faster_R_CNN_R_50_FPN_int8_opset12zoo_CPU, where GetParam() = (00000202C3998770, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6261476Z 6: [  FAILED  ] ModelTests/ModelTest.Run/BERT_Squad_int8_opset12zoo_CPU, where GetParam() = (00000202CF60B8E0, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6262077Z 6: [  FAILED  ] ModelTests/ModelTest.Run/FCN_ResNet_50_opset11zoo_CPU, where GetParam() = (0000020281105630, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6262690Z 6: [  FAILED  ] ModelTests/ModelTest.Run/FCN_ResNet_101_opset11zoo_CPU, where GetParam() = (0000020281105E50, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6263319Z 6: [  FAILED  ] ModelTests/ModelTest.Run/EfficientNet_Lite4_qdq_opset11zoo_CPU, where GetParam() = (0000020281106A30, 4-byte object <01-00 00-00>)
2022-10-21T14:37:12.6263957Z 6: [  FAILED  ] ModelTests/ModelTest.Run/EfficientNet_Lite4_int8_opset11zoo_CPU, where GetParam() = (00000202811054F0, 4-byte object <01-00 00-00>)

The text was updated successfully, but these errors were encountered:

jcwchen · 2022-10-23T15:54:51Z

Hi @mszhanyi,
Thanks for the report. What kinds of failure are these? Are they output mismatch? Regarding -int8 or -qdq models, their output will be possibly different with different CPU such as with/without VNNI support. (See #522)

But for FCN_ResNet_50 and FCN_ResNet_101, they are not quantized models... I was wondering what are the errors for them?

mszhanyi · 2022-10-23T16:39:32Z

@jcwchen
The error message is

2022-10-21T14:25:40.4140241Z 6: [ RUN      ] ModelTests/ModelTest.Run/FCN_ResNet_50_opset11zoo_CPU
2022-10-21T14:25:41.5542157Z 6: D:\a\_work\1\s\winml\test\model\model_tests.cpp(131): error: Expected: results = session.Evaluate(binding, L"Testing") doesn't throw an exception.
2022-10-21T14:25:41.5549681Z 6:   Actual: it throws.
2022-10-21T14:25:41.5911396Z 6: unknown file: error: SEH exception with code 0xc0000005 thrown in the test body.
2022-10-21T14:25:41.5912289Z 6: [  FAILED  ] ModelTests/ModelTest.Run/FCN_ResNet_50_opset11zoo_CPU, where GetParam() = (0000020281105630, 4-byte object <01-00 00-00>) (1177 ms)

from the log
https://dev.azure.com/onnxruntime/2a773b67-e88b-4c7f-9fc0-87d31fea8ef2/_apis/build/builds/789802/logs/23

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=789802&view=results

…13407) ### Description 1. update model name structure in model_tests.cpp with source name. To avoid `Condition test_param_names.count(param_name) == 0 failed. Duplicate parameterized test name 'BERT_Squad_opset10_CPU'` 2. skip some failed models onnx/models#568 ### Motivation and Context

jcwchen · 2022-10-25T15:52:59Z

Based on this line, it seems that their output mismatch even with considering tolerance. As I mentioned above, -int8 and -qdq models will produce different output without VNNI support and originally their test data in ONNX Model Zoo were produced by machines with VNNI support in CPU.

But for FCN_ResNet_50 and FCN_ResNet_101, their output should be correct and reproducible. I just tested them with my onnxruntime==1.12.0 (CPU ep by default) and the original output can pass the inferred one. May I understand what kind of execution provider these tests were using?

yuslepukhin · 2022-10-25T17:42:59Z

Is this an onnxruntime specific discussion? Then it would be better to move it there.

mszhanyi · 2022-10-28T08:28:07Z

@jcwchen Thank your reply.
The onnxruntime Windows CI is running on Azure Dsv5, the cpu flags are
avx512bitalg, avx512bw, avx512cd, avx512dq, avx512f, avx512ifma, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vpopcntdq

And I find you said Unfortunately I don't have machines with avx512_vnni support so I am not sure whether machines with avx512_vnni and machines with avx512f (and without avx512_vnni)

Did you create the test data with avx512f? I guess that may be the reason.

I updated my office desktop (DELL Precision 5820) last year, which has avx512vnni too.
cc @snnn

snnn · 2022-10-28T14:45:03Z

@jcwchen Some Azure machines have VNNI. I can give you one.

jcwchen · 2022-10-28T15:00:47Z

Sorry I shouldn't just say VNNI support which causes confusion. To clarify: All -int8 and -qdq models in ONNX Model Zoo were generated by machines with avx512f support and they also passed CI (GitHub Action machines sometimes do have avx512f support) in ONNX Model Zoo by running ORT CPU ep with uploaded test_data_set. That's why I was wondering if the running CI you have, which has these output mismatch failures, has avx512f support or not.

This is the CPU flags for GitHub Action in ONNX Model Zoo CI:

Flags: 3dnow, 3dnowprefetch, abm, adx, aes, apic, avx, avx2, avx512bw, avx512cd, avx512dq, avx512f, avx512vl, bmi1, bmi2, clflush, clflushopt, cmov, cx16, cx8, de, dts, erms, f16c, fma, fpu, fxsr, hle, ht, hypervisor, ia64, invpcid, lahf_lm, mca, mce, mmx, movbe, msr, mtrr, osxsave, pae, pat, pcid, pclmulqdq, pge, pni, popcnt, pse, pse36, rdrnd, rdseed, rtm, sep, serial, smap, smep, ss, sse, sse2, sse4_1, sse4_2, ssse3, tm, tsc, vme, xsave

yuslepukhin · 2022-10-28T17:14:29Z

There are access violation SEH occurring. This means it crashes. It is not a C++ exception.
This has been happening for some time in the packing pipeline and only with WinML builds.

yuslepukhin · 2022-10-28T17:21:48Z

***@***.*** ***@***.***> @sheil ***@***.***> I think this is related to the failures we have seen before in WinML builds. Some models Failed on Windows CPU * Issue #568 * onnx/models (github.com)<#568>

…

-- Dmitri From: Chun-Wei Chen ***@***.***> Sent: Friday, October 28, 2022 8:01 To: onnx/models ***@***.***> Cc: Dmitri Smirnov ***@***.***>; Comment ***@***.***> Subject: Re: [onnx/models] Some models Failed on Windows CPU (Issue #568) Sorry I shouldn't just say VNNI support which causes confusion. To clarify: All -int8 and -qdq models in ONNX Model Zoo were generated by machines with avx512f support and they also passed CI (GitHub Action machines sometimes do have avx512f support) in ONNX Model Zoo by running ORT CPU ep with uploaded test_data_set. That's why I was wondering if the running CI you have, which has these output mismatch failures, has avx512f support or not. This is the CPU flags for GitHub Action in ONNX Model Zoo CI: Flags: 3dnow, 3dnowprefetch, abm, adx, aes, apic, avx, avx2, avx512bw, avx512cd, avx512dq, avx512f, avx512vl, bmi1, bmi2, clflush, clflushopt, cmov, cx16, cx8, de, dts, erms, f16c, fma, fpu, fxsr, hle, ht, hypervisor, ia64, invpcid, lahf_lm, mca, mce, mmx, movbe, msr, mtrr, osxsave, pae, pat, pcid, pclmulqdq, pge, pni, popcnt, pse, pse36, rdrnd, rdseed, rtm, sep, serial, smap, smep, ss, sse, sse2, sse4_1, sse4_2, ssse3, tm, tsc, vme, xsave - Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fonnx%2Fmodels%2Fissues%2F568%23issuecomment-1295106699&data=05%7C01%7Cdmitrism%40microsoft.com%7C036133005d6e40836b3908dab8f53869%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638025660596859897%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EE3B%2FWcLKmRWYaNuDy%2FNgDZ3i%2FVJhUX8CSiMfPL2Yqw%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACWHYNHUY2FDVQKWQWTGLQDWFPTCTANCNFSM6AAAAAARLYDYEU&data=05%7C01%7Cdmitrism%40microsoft.com%7C036133005d6e40836b3908dab8f53869%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638025660596859897%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=acyCysuu0hGDvd8VIV7p5jqlT2TSpPUsUHbamBfv8KA%3D&reserved=0>. You are receiving this because you commented.Message ID: ***@***.******@***.***>>

snnn · 2022-10-28T19:57:57Z

@yufenglee , if I understand correctly, this indicates we have bugs in some kernels that make uses of VNNI. @jcwchen 's machines do not have VNNI so actually they didn't run the int8 models in int8 mode. But ORT team's machines now have VNNI, so they caught the error.

snnn · 2022-10-28T20:00:13Z

@yuslepukhin , the error mszhanyi showed to us was generated from a CPU machine. Though it used winml, but it didn't use DirectML. So error would have come from our CPU EP.

yufenglee · 2022-10-28T21:46:12Z

@yufenglee , if I understand correctly, this indicates we have bugs in some kernels that make uses of VNNI. @jcwchen 's machines do not have VNNI so actually they didn't run the int8 models in int8 mode. But ORT team's machines now have VNNI, so they caught the error.

@snnn, this is not a bug. Quantization with U8S8 or S8S8 can saturate on machine without VNNI when computing quantized MatMul and Conv: https://onnxruntime.ai/docs/performance/quantization.html#when-and-why-do-i-need-to-try-u8u8.

We added a session option to avoid the issue for quantization with QDQ format but it leads a worse latency: https://github.com/microsoft/onnxruntime/blob/0b0c51e02890ea35d0e5023681f2362421e44ceb/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h#L101

For the test case, we should generate the output on VNNI or with this option on, and also run the model tests with the option enabled on machine without VNNI.

snnn · 2022-10-28T22:05:56Z

For the test case, we should generate the output on VNNI or with this option on,

@jcwchen , what do you think?

…13407) ### Description 1. update model name structure in model_tests.cpp with source name. To avoid `Condition test_param_names.count(param_name) == 0 failed. Duplicate parameterized test name 'BERT_Squad_opset10_CPU'` 2. skip some failed models onnx/models#568 ### Motivation and Context

jcwchen · 2022-10-29T00:10:32Z

Thank you everyone for the helpful pointers. I previously thought all of existing test_data_set for -int8.onnx and -qdq.onnx models are all generated by machines with VNNI support, but I am not very sure. At least merged models after this PR: #526 should be.

@mengniwang95 May I ask you is it correct that all existing test_data_set you have uploaded for -int8.onnx and -qdq.onnx models are all generated by machines with VNNI support?

jcwchen · 2022-10-29T00:12:55Z

I run some experiments locally with my machine without VNNI support and enable so.add_session_config_entry("session.x64quantprecision", "1") for InferenceSession, but the result seems similar:

Original model+test_data_set which CANNOT pass (output mismatch) my local previously. With enabling session.x64quantprecision, the original output still mismatches: yolov3-12-int8.tar.gz.
Original model+test_data_se which CAN pass (output matches) my local previously. With enabling session.x64quantprecision, the original output still matches: VGG_16_int8.tar.gz, MaskRCNN-12-int8.tar.gz.

@yufenglee please let me know if I misused or misunderstood. Thank you!

Going forward I will try to run further experiments on machines with VNNI support. @snnn I will reach out to you to learn how to get a VNNI machine from you next week. Thank you for letting me know.

mengniwang95 · 2022-10-31T02:19:07Z

are all generated by machines with VNNI support, but I am not very sure. At least merged models after this PR: #526 should be.

@mengniwang95 May I ask you is it correct that all existing test_data_set you have uploaded for -int8.onnx and -qdq.onnx models are all generated by machines with VNNI support?

all existing test_data_set for -int8/qdq.onnx models are generated with no-VNNI support

mszhanyi · 2022-10-31T03:29:10Z

@yuslepukhin , the error mszhanyi showed to us was generated from a CPU machine. Though it used winml, but it didn't use DirectML. So error would have come from our CPU EP.

For FCN_ResNet, I've opened a new issue in onnxruntime microsoft/onnxruntime#13509

yuslepukhin · 2022-10-31T16:51:47Z

@yuslepukhin , the error mszhanyi showed to us was generated from a CPU machine. Though it used winml, but it didn't use DirectML. So error would have come from our CPU EP.

That may be true, but it is also a fact that it only takes place in Winml builds.

jcwchen · 2022-11-02T22:42:48Z

Thank you @snnn for providing machines with VNNI support. I have just run a few experiments.

I created several test cases in a non-VNNI machine with enabling session.x64quantprecision+ .qdq.onnx or -int8.onnx models, but they still encounter same output mismatch errors in VNNI machines.

If output in VNNI machines cannot reproduce from non-VNNI machine (@yufenglee please correct me if I am wrong. Thank you!), I slightly tend to still keep original output which were generated from non-VNNI machines due to two reasons:

Contributors won't always have VNNI support machines. Not sure whether it is reasonable to ask them to generate the test data for -int8 or -qdq models with VNNI machines though.
Current GitHub Action CI in this repo does not have VNNI support anyway. If the output were generated from VNNI machines, current CI in ONNX Model Zoo cannot validate them (always output mismatch).

If ONNX Model Zoo still keeps original output data, perhaps in ORT testing we can regenerate those test data in a VNNI machine on the fly or just skip them for now. In this repo, we can add some description about this behavior difference between VNNI machines and non-VNNI machines for quantized ONNX models to prevent confusion. If anyone has other concern, feel free to bring up. Thanks!

snnn · 2022-11-03T22:59:16Z

Here VNNI means 8-bit support. If your machine does not have 8-bit support, I think you should not use it to generate or test 8-bit models.

jcwchen · 2022-11-04T23:10:40Z

Here VNNI means 8-bit support. If your machine does not have 8-bit support, I think you should not use it to generate or test 8-bit models.

I thought there might be use cases for 8-bit models on non-VNNI machines, but I could be wrong since I am not really familiar with quantization.

@mengniwang95 since you and your team are the main contributors for quantized models in ONNX Model Zoo (Thank you for the contribution!), may I understand what's your opinion about making test_data_set for quantized models be generated by VNNI machines? Do you have VNNI machines to generate them? Thanks!

snnn · 2022-11-04T23:33:21Z

Generally speaking, for the same model and same inputs, I don't think it's wrong if different hardware may generate different outputs. There is no unique answer for machine learning tasks, but we need to define what kind of differences are tolerable. In this case, how differ they are? Do we think the results generated on non-VNNI machines are correct or not?

mengniwang95 · 2022-11-07T01:50:21Z

@jcwchen I have VNNI machine to generate test_data_set. In my opinion, if you have VNNI machine to do pre-ci test, it is okay to upload VNNI test_data_set, otherwise there is no necessary. Since VNNI int8 model will get different outputs on VNNI machine and non-VNNI machine with same input.

jcwchen · 2022-11-09T02:07:02Z

@mengniwang95 Thank you for the feedback!

Generally speaking, for the same model and same inputs, I don't think it's wrong if different hardware may generate different outputs. There is no unique answer for machine learning tasks, but we need to define what kind of differences are tolerable. In this case, how differ they are? Do we think the results generated on non-VNNI machines are correct or not?

The difference of ORT inference result between VNNI machine and non-VNNI machine for quantized models is too significant. I am not sure whether the result generated from non-VNNI machines are reasonable, but anyway the result produced by VNNI machines should be accurate and more reliable. It makes sense to me that providing results by VNNI machines is better to prevent user's confusion. I have updated those failed output with a VNNI machine by this PR: #572. Going forward for newer check-in quantized models, ideally we should provide output result by VNNI machines as well.

However, my only concern is current CI cannot verify output generated from VNNI machines... For now, these output for quantized models will be skipped by the CIs and I can manually test the new PR from my end with a local VNNI machine. To make it automatic, we will need a self-hosted machine with VNNI support in GitHub Action. I will create an issue to track this work item.

…13407) ### Description 1. update model name structure in model_tests.cpp with source name. To avoid `Condition test_param_names.count(param_name) == 0 failed. Duplicate parameterized test name 'BERT_Squad_opset10_CPU'` 2. skip some failed models onnx/models#568 ### Motivation and Context

mszhanyi added the question label Oct 22, 2022

mszhanyi mentioned this issue Oct 22, 2022

Skip some failed models winml and training workflows on Windows CPU microsoft/onnxruntime#13407

Merged

jcwchen mentioned this issue Nov 9, 2022

Update output for quantized models by ORT on VNNI machine #572

Merged

jcwchen mentioned this issue Nov 13, 2022

Add self-hosted machine with VNNI support in GitHub Action CI #574

Open

jcwchen closed this as completed in #572 Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some models Failed on Windows CPU #568

Some models Failed on Windows CPU #568

mszhanyi commented Oct 22, 2022

jcwchen commented Oct 23, 2022 •

edited

Loading

mszhanyi commented Oct 23, 2022 •

edited

Loading

jcwchen commented Oct 25, 2022

yuslepukhin commented Oct 25, 2022

mszhanyi commented Oct 28, 2022 •

edited

Loading

snnn commented Oct 28, 2022

jcwchen commented Oct 28, 2022

yuslepukhin commented Oct 28, 2022 •

edited

Loading

yuslepukhin commented Oct 28, 2022 via email

snnn commented Oct 28, 2022

snnn commented Oct 28, 2022

yufenglee commented Oct 28, 2022

snnn commented Oct 28, 2022

jcwchen commented Oct 29, 2022

jcwchen commented Oct 29, 2022

mengniwang95 commented Oct 31, 2022

mszhanyi commented Oct 31, 2022 •

edited

Loading

yuslepukhin commented Oct 31, 2022

jcwchen commented Nov 2, 2022 •

edited

Loading

snnn commented Nov 3, 2022

jcwchen commented Nov 4, 2022 •

edited

Loading

snnn commented Nov 4, 2022

mengniwang95 commented Nov 7, 2022

jcwchen commented Nov 9, 2022

Some models Failed on Windows CPU #568

Some models Failed on Windows CPU #568

Comments

mszhanyi commented Oct 22, 2022

Ask a Question

Question

Exception Message

jcwchen commented Oct 23, 2022 • edited Loading

mszhanyi commented Oct 23, 2022 • edited Loading

jcwchen commented Oct 25, 2022

yuslepukhin commented Oct 25, 2022

mszhanyi commented Oct 28, 2022 • edited Loading

snnn commented Oct 28, 2022

jcwchen commented Oct 28, 2022

yuslepukhin commented Oct 28, 2022 • edited Loading

yuslepukhin commented Oct 28, 2022 via email

snnn commented Oct 28, 2022

snnn commented Oct 28, 2022

yufenglee commented Oct 28, 2022

snnn commented Oct 28, 2022

jcwchen commented Oct 29, 2022

jcwchen commented Oct 29, 2022

mengniwang95 commented Oct 31, 2022

mszhanyi commented Oct 31, 2022 • edited Loading

yuslepukhin commented Oct 31, 2022

jcwchen commented Nov 2, 2022 • edited Loading

snnn commented Nov 3, 2022

jcwchen commented Nov 4, 2022 • edited Loading

snnn commented Nov 4, 2022

mengniwang95 commented Nov 7, 2022

jcwchen commented Nov 9, 2022

jcwchen commented Oct 23, 2022 •

edited

Loading

mszhanyi commented Oct 23, 2022 •

edited

Loading

mszhanyi commented Oct 28, 2022 •

edited

Loading

yuslepukhin commented Oct 28, 2022 •

edited

Loading

mszhanyi commented Oct 31, 2022 •

edited

Loading

jcwchen commented Nov 2, 2022 •

edited

Loading

jcwchen commented Nov 4, 2022 •

edited

Loading