In CUDA terminology, a threadgroup is executed by a "streaming multiprocessor." In Metal terminology, is a threadgroup executed by a "core" or an "execution unit" (within a core)? I can find no resources to answer online, but resources imply differently.
Regardless the answer, why do Apple GPU's have this two-layer architecture of cores and execution units, whereas Nvidia has the single layer of streaming multiprocessors? Are both layers visible/accessible to the Metal programmer, or only one layer (whichever corresponds to threadgroups)? What's the purpose of the other layer?
I dint know whether Wikipedia description is accurate. An Apple GPU core is a hardware device that has its own register file and cache (probably its own separate L2 cache). Logically, it also contains 4 32-wide (1024-bit) SIMD ALUs - we dint know how the hardware actually looks like, it could be multiple smaller ALUs operating in lockstep. Since a threadgroup is guaranteed to share the on-device memory, all threads in a threadgroup will execute on the same GPU core (with the caveat that the driver might decide to move the threadgroup to a different core for various reasons).
As to the difference in layers, logically, there are none. Nvidia marketing focuses on “CUDA cores” (single lane of an ALU), Apple marketing focuses on “cores”. There are obviously differences in hardware architecture. For what I understand, Apples cores are minimal hardware units that can be individually replicated. The closest Nvidia equivalent is probably the GPC. Functionally, however, Apple core is not unlike Nvidia’s SM. Most of these differences are likely because of the different approach to rendering.