How many warps can be run in parallel on a single shader core?

The Metal feature set tables specifies that beginning with the Apple4 family, the "Maximum threads per threadgroup" is 1024. Given that a single threadgroup is guaranteed to be run on the same GPU shader core, it means that a shader core of any new Apple GPU must be capable of running at least 1024/32 = 32 warps in parallel.

From the WWDC session "Scale compute workloads across Apple GPUs (6:17)":

For relatively complex kernels, 1K to 2K concurrent threads per shader core is considered a very good occupancy.

The cited sentence suggests that a single shader core is capable of running at least 2K (I assume this is meant to be 2048) threads in parallel, so 2048/32 = 64 warps running in parallel.

However, I am curious what is the maximum theoretical amount of warps running in parallel on a single shader core (it sounds like it is more than 64). The WWDC session mentions 2K to be only "very good" occupancy. How many threads would be "the best possible" occupancy?

Given that a single threadgroup is guaranteed to be run on the same GPU shader core, it means that a shader core of any new Apple GPU must be capable of running at least 1024/32 = 32 warps in parallel.

This calculation is based on the incorrect assumption that all threads in a threadgroup are guaranteed to execute simultaneously. Only the threads in a SIMD group are guaranteed to execute simultaneously.

How many warps can be run in parallel on a single shader core?
 
 
Q