Are threadgroups executed by cores or execution units on Apple GPUs?

In CUDA terminology, a threadgroup is executed by a "streaming multiprocessor." In Metal terminology, is a threadgroup executed by a "core" or an "execution unit" (within a core)? I can find no resources to answer online, but resources imply differently.

Regardless the answer, why do Apple GPU's have this two-layer architecture of cores and execution units, whereas Nvidia has the single layer of streaming multiprocessors? Are both layers visible/accessible to the Metal programmer, or only one layer (whichever corresponds to threadgroups)? What's the purpose of the other layer?

Answered by jcookie in 744685022

I dint know whether Wikipedia description is accurate. An Apple GPU core is a hardware device that has its own register file and cache (probably its own separate L2 cache). Logically, it also contains 4 32-wide (1024-bit) SIMD ALUs - we dint know how the hardware actually looks like, it could be multiple smaller ALUs operating in lockstep. Since a threadgroup is guaranteed to share the on-device memory, all threads in a threadgroup will execute on the same GPU core (with the caveat that the driver might decide to move the threadgroup to a different core for various reasons).

As to the difference in layers, logically, there are none. Nvidia marketing focuses on “CUDA cores” (single lane of an ALU), Apple marketing focuses on “cores”. There are obviously differences in hardware architecture. For what I understand, Apples cores are minimal hardware units that can be individually replicated. The closest Nvidia equivalent is probably the GPC. Functionally, however, Apple core is not unlike Nvidia’s SM. Most of these differences are likely because of the different approach to rendering.

Hi Joseph, what document are you referring to that mentions this two-layer architectures of "cores" and "execution units"?

On Wikipedia, for example, it says "The M1 integrates an Apple designed[17] eight-core (seven in some base models) graphics processing unit (GPU). Each GPU core is split into 16 Execution Units, which each contain eight Arithmetic Logic Units (ALUs)."

Accepted Answer

I dint know whether Wikipedia description is accurate. An Apple GPU core is a hardware device that has its own register file and cache (probably its own separate L2 cache). Logically, it also contains 4 32-wide (1024-bit) SIMD ALUs - we dint know how the hardware actually looks like, it could be multiple smaller ALUs operating in lockstep. Since a threadgroup is guaranteed to share the on-device memory, all threads in a threadgroup will execute on the same GPU core (with the caveat that the driver might decide to move the threadgroup to a different core for various reasons).

As to the difference in layers, logically, there are none. Nvidia marketing focuses on “CUDA cores” (single lane of an ALU), Apple marketing focuses on “cores”. There are obviously differences in hardware architecture. For what I understand, Apples cores are minimal hardware units that can be individually replicated. The closest Nvidia equivalent is probably the GPC. Functionally, however, Apple core is not unlike Nvidia’s SM. Most of these differences are likely because of the different approach to rendering.

Are threadgroups executed by cores or execution units on Apple GPUs?
 
 
Q