4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake

XNNPACK by default uses 5x16 fp32-gemm kernel for `x86_fma3`, but we found that 4x16s4 kernel shows better performance on `meteor lake` CPU (`Intel(R) Core(TM) Ultra 7 155H`)

|  benchmark  |  5x16 (us)  |  4x16s4 (us) | Reduction on inference time (%) |
| :---: |  :---: |  :---: | :---: |
| FP32MobileNetV1/T:1/real_time | 16193 | 10775 | 33.46 |
| FP32MobileNetV2/T:1/real_time | 8809 | 6626 | 24.78 |
| FP32MobileNetV3Large/T:1/real_time | 7756 | 6052 | 21.97 |
| FP32MobileNetV3Small/T:1/real_time | 2180 | 1970 | 9.63 |

Here is the code to reproduce the above data: https://github.com/xujuntwt95329/XNNPACK/tree/0143aab98634c866b319decca52590e1eb54b9dd

We can submit PR if this is welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake #6480

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

benchmark	5x16 (us)	4x16s4 (us)	Reduction on inference time (%)
FP32MobileNetV1/T:1/real_time	16193	10775	33.46
FP32MobileNetV2/T:1/real_time	8809	6626	24.78
FP32MobileNetV3Large/T:1/real_time	7756	6052	21.97
FP32MobileNetV3Small/T:1/real_time	2180	1970	9.63