Issue with int16 threadgroup memory

I'm running into an issue with threadgroup memory where data written to it seemingly gets lost when I use int8_t or int16_t element types:

#include <metal_stdlib>
using namespace metal;

kernel void kernel_function(device int16_t* R, uint index [[thread_position_in_threadgroup]]) {
    threadgroup int16_t shared[1];
    shared[index] = (int16_t) 42;

    threadgroup_barrier(mem_flags::mem_threadgroup);

    R[0] = shared[index];
}

If I execute this kernel (using the following host code: https://gist.github.com/maleadt/ffcda8fc94f03f32347c3167ccca70a8 ), I get zeros in my output buffer. If I change the element type from int16_t to int32_t (just find/replace in the kernel and host code) I get the expected results.

I'm new to Metal, so I guess I'm doing something wrong here. I'm using an M1 Pro on Monterey, with Xcode 13.4.1.

EDIT: interestingly, running under MTL_SHADER_VALIDATION=1 results in the expected output, so this does start to look like a miscompilation in the back-end.

Here index can be much larger than the size of the threadgroup array shared which only has 1 element. You read and write past the end of the array, which is undefined behavior.

This also reproduced with correctly-sized threadgroup memory. Furthermore, this does look fixed now that I'm running macOS 13.1, so I'm guessing this was a bug that got fixed?

In fact, it was confirmed to me that this was indeed a LLVM back-end bug that was fixed in macOS 13 beta 4.

Issue with int16 threadgroup memory
 
 
Q