1 Reply
      Latest reply on Jan 9, 2020 9:31 AM by Fritzt
      SuperCoolEugene Level 1 Level 1 (0 points)

        Hi everyone!

         

        I have a kernel Metal function which basically looks like this:

         

        struct Matrix { 
            half arr[562500]; //enough to store 750x750 matrix 
        };  
        
        struct Output { 
           half arr[12288]; 
        };   
        
        kernel void compute_features(device Output& buffer [[ buffer(0) ]], 
                                     const device Matrix& mtx_0 [[ buffer(1) ]], 
                                     const device Matrix& mtx_1 [[ buffer(2) ]], 
                                     constant short2& matSize [[ buffer(3) ]], 
                                     constant float& offset [[ buffer(4) ]], 
                                     ushort2 gid [[ thread_position_in_grid ]]) 
        {   
            for (int i = 0; i < 12; i++) { 
                for (int j = 0; j < 12; j++) {  
                    int mat_id = i * matSize.x + j;  
                    half matrixValue_0 = mtx_0.mat[mat_id]; 
                    half matrixValue_1 = mtx_1.mat[mat_id] - offset;  
                    short someId_0 = 0; 
                    short someId_1 = 0; 
                    short someId_2 = 0; 
                    short someId_3 = 0;  //those ids will be calculated at the code below 
                    half value = 0.h;  //this value will be calculated at the code below  
        
                    //some math where `someId` and `value` are calculated with usage of `matrixValue_0` and `matrixValue_1`  
        
                    if (some_condition0) { 
                        buffer.arr[someId_0] += value; 
                    }  
        
                    if (some_condition1) { 
                        buffer.arr[someId_1] += value; 
                    }  
        
                    if (some_condition2) { 
                        buffer.arr[someId_2] += value; 
                    }  
        
                    if (some_condition3) { 
                        buffer.arr[someId_3] += value; 
                    } 
                }
            }
        }

        I understand that this code has its down-sides - dynamic indexing and big loop. But unfortunately the algorithm I'm trying to express can not be implemented differently at that point.

         

        Now, this code runs very good at iPhone 7+, it takes around 200us per iteration, and I'm very happy with this number.

        BUT, I tried to run the exact same algorithm on iPhone XR and I was surprised to see that this algorithm takes around 1.0-1.2 ms to complete.

         

        With the help of XCode and it's magnificent GPU pipeline debugging tool I found out that my bottlenecks are:

         

        1)

            
            half matrixValue_0 = mtx_0.mat[mat_id]; 
            half matrixValue_1 = mtx_1.mat[mat_id] - offset;

        It seems that significant part of processing time are spent in Memory Load operation.

         

        2)

        if (some_condition0) { 
           buffer[someId_0] += value; 
        }
        
        if (some_condition1) { 
            buffer[someId_1] += value;
        }  
        
        if (some_condition2) { 
            buffer[someId_2] += value; 
        }  
        
        if (some_condition3) { 
            buffer[someId_3] += value; 
        }

        The major processing time are spent for Memory Store operation.

         

        For me it seems like iPhone XR quite struggles operating with device memory because bottle-necks are in places where I work with containers which are stored in device memory.

        I understand that I'm using dynamic indexing - compiler can not really predict what address in the container will be loaded/stored in certain iteration. But the code works very good on iPhone 7+, but not on iPhone XR.

        I suspect that it might have something to do with byte alignment. Can it be somehow related to that?

        I would love to hear some suggestions on this. Thanks in advance!

        • Re: Metal Performance on iPhone XR
          Fritzt Level 1 Level 1 (10 points)

          Did you ever find out about the cause? I'm experiencing the same with A12 devices (iPhone Xs 256GB + iPad Pro 11"). Access to large 3D textures used to be much faster on older devices, so it seems. I wonder if this is related to the new TBDR-architecture of the A11. Altrough I remember it being faster on the iPhone X...