Skip to content

4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake #6480

Open
@xujuntwt95329

Description

XNNPACK by default uses 5x16 fp32-gemm kernel for x86_fma3, but we found that 4x16s4 kernel shows better performance on meteor lake CPU (Intel(R) Core(TM) Ultra 7 155H)

benchmark 5x16 (us) 4x16s4 (us) Reduction on inference time (%)
FP32MobileNetV1/T:1/real_time 16193 10775 33.46
FP32MobileNetV2/T:1/real_time 8809 6626 24.78
FP32MobileNetV3Large/T:1/real_time 7756 6052 21.97
FP32MobileNetV3Small/T:1/real_time 2180 1970 9.63

Here is the code to reproduce the above data: https://github.com/xujuntwt95329/XNNPACK/tree/0143aab98634c866b319decca52590e1eb54b9dd

We can submit PR if this is welcome.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions