issues with heavy register usage

I'm running a kernel with heavy register usage. I'd appreciate explanation for any of three issues. Why do these issues exist and what might I do to mitigate?

Maybe relevant: I'm reading and writing back to a host-shared buffer on an M2 chip.

  1. Only once have I gotten error reporting that there's not enough register space. Otherwise the kernel fails by returning the wrong answer (all zeros), reporting nothing. Why? Why can it not report properly? Below when I say the thread fails, I mean it returns all zeros.
  2. I run the kernel with and without a threadgroup_barrier(mem_flags:mem_none), and this difference alone determines whether the kernel succeeds or fails. How can this be? What resource is required for a threadgroup_barrier even just for execution synchronization (rather than memory synchronization) that can crash the kernel? Is it not just a matter of waiting with no resource consumption?
  3. I run the kernel in a certain configuration and it succeeds. I run it similarly but with half as many registers and it fails.

These issues don't make sense to me, much less I know how to fix them, so I must be misunderstanding some fundamental principles of execution.

This can only be a bug in Metal. I have the same issue that running compiled program more than once I get different output with same input.

issues with heavy register usage
 
 
Q