I found similar issues and I simplified the case to as below:
Given an array unit32_t* inA, I want to output an array with each element increased by 1.
Every thing works until the array length becomes 1024 * 1024 * 4, when the output array becomes all 0.
It works even when the array length is 1024 * 1024 * 4 - 1.
And, somehow I increase the array size to 1024 * 1024 * 4 + 128 * 128, it works again... as a really weird workround.
Could anyone explain why 1024 * 1024 * 4 is a special number?
Thanks
kernel void increase_array(
/* param idx 0 - setBuffer */
device const uint32_t* inA,
/* param idx 1 - setBuffer */
device uint32_t* result,
/* the thread index */
uint index [[thread_position_in_grid]]
)
{
// the for-loop is replaced with a collection of threads, each of which
// calls this function.
result[index] = index;
}