How many 32-bit variables can I use concurrently in a single thread of a Metal compute kernel without worrying about the variables getting spilled into the device
memory? Alternatively: how many 32-bit registers does a single thread have available for itself?
Let's say that each thread of my compute kernel needs to store and work with its own array of N float
variables, where N can be 128, 256, 512 or more. To achieve maximum possible performance, I do not want to the local thread
variables to get spilled into the slow device
memory. I want all N variables to be stored "on-chip", in the thread
memory space.
To make my question more concrete, let's say there is an array thread float localArray[N]
. Assuming an unrealistic hypothetical scenario where localArray
is the only variable in the whole kernel, what is the maximum value of N
for which no portion of localArray
would get spilled into the device
memory?
I searched in the Metal feature set tables, but I could not find any details.