Hi everyone!
I have a kernel
Metal
function which basically looks like this:struct Matrix {
half arr[562500]; //enough to store 750x750 matrix
};
struct Output {
half arr[12288];
};
kernel void compute_features(device Output& buffer [[ buffer(0) ]],
const device Matrix& mtx_0 [[ buffer(1) ]],
const device Matrix& mtx_1 [[ buffer(2) ]],
constant short2& matSize [[ buffer(3) ]],
constant float& offset [[ buffer(4) ]],
ushort2 gid [[ thread_position_in_grid ]])
{
for (int i = 0; i < 12; i++) {
for (int j = 0; j < 12; j++) {
int mat_id = i * matSize.x + j;
half matrixValue_0 = mtx_0.mat[mat_id];
half matrixValue_1 = mtx_1.mat[mat_id] - offset;
short someId_0 = 0;
short someId_1 = 0;
short someId_2 = 0;
short someId_3 = 0; //those ids will be calculated at the code below
half value = 0.h; //this value will be calculated at the code below
//some math where `someId` and `value` are calculated with usage of `matrixValue_0` and `matrixValue_1`
if (some_condition0) {
buffer.arr[someId_0] += value;
}
if (some_condition1) {
buffer.arr[someId_1] += value;
}
if (some_condition2) {
buffer.arr[someId_2] += value;
}
if (some_condition3) {
buffer.arr[someId_3] += value;
}
}
}
}
I understand that this code has its down-sides - dynamic indexing and big loop. But unfortunately the algorithm I'm trying to express can not be implemented differently at that point.
Now, this code runs very good at
iPhone 7+
, it takes around 200us
per iteration, and I'm very happy with this number.BUT, I tried to run the exact same algorithm on
iPhone XR
and I was surprised to see that this algorithm takes around 1.0-1.2 ms
to complete.With the help of
XCode
and it's magnificent GPU pipeline debugging tool I found out that my bottlenecks are:1)
half matrixValue_0 = mtx_0.mat[mat_id];
half matrixValue_1 = mtx_1.mat[mat_id] - offset;
It seems that significant part of processing time are spent in
Memory Load
operation.2)
if (some_condition0) {
buffer[someId_0] += value;
}
if (some_condition1) {
buffer[someId_1] += value;
}
if (some_condition2) {
buffer[someId_2] += value;
}
if (some_condition3) {
buffer[someId_3] += value;
}
The major processing time are spent for
Memory Store
operation.For me it seems like
iPhone XR
quite struggles operating with device
memory because bottle-necks are in places where I work with containers which are stored in device
memory.I understand that I'm using dynamic indexing - compiler can not really predict what address in the container will be loaded/stored in certain iteration. But the code works very good on
iPhone 7+
, but not on iPhone XR
.I suspect that it might have something to do with byte alignment. Can it be somehow related to that?
I would love to hear some suggestions on this. Thanks in advance!