Posts

Post not yet marked as solved
1 Replies
667 Views
Hi everyone!I have a kernel Metal function which basically looks like this:struct Matrix { half arr[562500]; //enough to store 750x750 matrix }; struct Output { half arr[12288]; }; kernel void compute_features(device Output& buffer [[ buffer(0) ]], const device Matrix& mtx_0 [[ buffer(1) ]], const device Matrix& mtx_1 [[ buffer(2) ]], constant short2& matSize [[ buffer(3) ]], constant float& offset [[ buffer(4) ]], ushort2 gid [[ thread_position_in_grid ]]) { for (int i = 0; i < 12; i++) { for (int j = 0; j < 12; j++) { int mat_id = i * matSize.x + j; half matrixValue_0 = mtx_0.mat[mat_id]; half matrixValue_1 = mtx_1.mat[mat_id] - offset; short someId_0 = 0; short someId_1 = 0; short someId_2 = 0; short someId_3 = 0; //those ids will be calculated at the code below half value = 0.h; //this value will be calculated at the code below //some math where `someId` and `value` are calculated with usage of `matrixValue_0` and `matrixValue_1` if (some_condition0) { buffer.arr[someId_0] += value; } if (some_condition1) { buffer.arr[someId_1] += value; } if (some_condition2) { buffer.arr[someId_2] += value; } if (some_condition3) { buffer.arr[someId_3] += value; } } } }I understand that this code has its down-sides - dynamic indexing and big loop. But unfortunately the algorithm I'm trying to express can not be implemented differently at that point.Now, this code runs very good at iPhone 7+, it takes around 200us per iteration, and I'm very happy with this number.BUT, I tried to run the exact same algorithm on iPhone XR and I was surprised to see that this algorithm takes around 1.0-1.2 ms to complete.With the help of XCode and it's magnificent GPU pipeline debugging tool I found out that my bottlenecks are:1) half matrixValue_0 = mtx_0.mat[mat_id]; half matrixValue_1 = mtx_1.mat[mat_id] - offset;It seems that significant part of processing time are spent in Memory Load operation.2)if (some_condition0) { buffer[someId_0] += value; } if (some_condition1) { buffer[someId_1] += value; } if (some_condition2) { buffer[someId_2] += value; } if (some_condition3) { buffer[someId_3] += value; }The major processing time are spent for Memory Store operation.For me it seems like iPhone XR quite struggles operating with device memory because bottle-necks are in places where I work with containers which are stored in device memory.I understand that I'm using dynamic indexing - compiler can not really predict what address in the container will be loaded/stored in certain iteration. But the code works very good on iPhone 7+, but not on iPhone XR.I suspect that it might have something to do with byte alignment. Can it be somehow related to that?I would love to hear some suggestions on this. Thanks in advance!
Posted Last updated
.