GPU Hardware and Metal concerning Tile Memory

In the WWDC talks on Metal that I have watched so far, many of the videos talk about Apple's A_ (fill in the blank, 11, 12, etc.) chip and the power it gives to the developer, such as allowing developers to leverage tile memory by opting to use TBDR. On macOS (at least Intel macs without the M1 chip), TBDR is unavailable, and other objects that leverage tile memory like image blocks are also unavailable. That made me wonder about the structure of the GPUs on macOS and external GPUs like the Blackmagic eGPU (which is currently hooked up to my computer). Are the concepts of tile memory ubiquitous across GPU architectures?

For example, if in a Metal kernel function we declared
Code Block
threadgroup float tgfloats[16];

Is this value stored in tile memory (threadgroup memory) on the Blackmagic? Or is there an equivalent storage that is dependent on hardware but available on all hardware in some form?

I know there are some WWDCs that deal with multiple GPUs which will probably be helpful, but extra information is always useful. Any links to information about GPU hardware architectures would be appreciated as well
Answered by Graphics and Games Engineer in 650526022
On M1, tiles, which are used to store render target data during fragment shader executions, are use as threadgroup memory when a compute kernel executes. Although AMD and Intel GPUs do not have tile memory as they are Immediate mode renderers (IMR), the do have dedicated threadgroup memory caches for compute kernels. The characteristics of these caches, including bandwidth and size, differ.

M1 and the iOS GPU have some features which make using compute with rendering more efficient. This includes Tile shaders and image blocks.. These allow you mix compute kernels with rendering and utilize the on-chip tile memory to share data between shaders and compute kernels.

Although AMD and Intel GPUs do not have these features, their immediate mode rendering architectures make mixing separate render and compute passes less costly than on M1 and iOS GPUs and, in many cases, allow them overcome the advantages of using tile shader.
It's more a consequence of the problems of scale:

As you increase the scale and number of parts, communication slows down.

The fast memory on the chip used for tile memory is smaller.

TBDR is about taking advantage of faster memory access with smaller amounts of data that can fit in smaller amounts of physical memory.

As far as I am aware, for desktop gpus, there was only one line of Nvidia cards that had something similar.

If you run a compute kernel on the M1, you would expect it to be more efficient in memory access in this regard compared to the other gpus. However, in situations that go beyond the M1's storage and processing ability, of course, you would see a point where the overall results would seem like it is underpowered.

This is how you know what strategies and use cases are more appropriate for the M1 vs an eGPU. An eGPU is slow in communication no matter how powerful a card you put in there, so it really isn't appropriate for smooth interactive rendering views. It's more appropriate to use eGPUs for non-realtime offscreen rendering. The many other pipeline benefits of TBDR will come into play if your use case is something that really matches what it is best at - but that may often not be the case, and you may still need multiple render passes.

Accepted Answer
On M1, tiles, which are used to store render target data during fragment shader executions, are use as threadgroup memory when a compute kernel executes. Although AMD and Intel GPUs do not have tile memory as they are Immediate mode renderers (IMR), the do have dedicated threadgroup memory caches for compute kernels. The characteristics of these caches, including bandwidth and size, differ.

M1 and the iOS GPU have some features which make using compute with rendering more efficient. This includes Tile shaders and image blocks.. These allow you mix compute kernels with rendering and utilize the on-chip tile memory to share data between shaders and compute kernels.

Although AMD and Intel GPUs do not have these features, their immediate mode rendering architectures make mixing separate render and compute passes less costly than on M1 and iOS GPUs and, in many cases, allow them overcome the advantages of using tile shader.
GPU Hardware and Metal concerning Tile Memory
 
 
Q