Custom CoreImage Box Blur kernel Much slower than equivalent built-in CIBoxBlur

I've created a custom BoxBlur kernel that produces identical results to Apple's built-in box blur (CIBoxBlur) kernel but my custom kernel is orders of magnitude slower. So naturally I am wondering what I'm doing wrong to get such poor performance. Below is my custom kernel in the Metal shading language. Can you spot why it's so slow? The built-in filter performs well so I can only assume it's something I'm doing wrong.

#include <CoreImage/CoreImage.h>
#import <simd/simd.h>

extern "C" {
    namespace coreimage {
        float4 customBoxBlurFilterKernel(sampler src) {
            float2 crd = src.coord();
            
            int edge = 100;
            
            int minx = crd.x - edge;
            int maxx = crd.x + edge;
            int miny = crd.y - edge;
            int maxy = crd.y + edge;
            
            float4 sums = float4(0,0,0,0);
            float cnt = 0;
            // compute average of surrounding rgb values
            for(int row=miny; row < maxy; row++) {
                for(int col=minx; col < maxx; col++) {
                    float4 samp = src.sample(float2(col, row));
                    sums[0] += samp[0];
                    sums[1] += samp[1];
                    sums[2] += samp[2];
                    cnt += 1.;
                }
            }
            
            return float4(sums[0]/cnt, sums[1]/cnt, sums[2]/cnt, 1);
        }
    }
}

It appears that you're not using a separable approach to your filter. Using a separable approach will take down the complexity from O(n^2) to O(2n) where the kernel size is nxn. I suggest you take a look at this video https://www.youtube.com/watch?v=SiJpkucGa1o to learn more about separable filters.

I also suggest mipmapping the texture you want to blur and blurring one of the lower resolution levels of the mipmap. You can then scale it up with linear sampling to avoid pixelation of the final result. This is the approach that apple takes in this example for their implementation of a Gaussian Blur filter. The same approach can be used for box blurs.

Finally, you can take advantage of the GPU's linear sampling to effectively half your kernel radius for the same result, which will make your filter much faster. To use this method, all you have to do is sample in between every 2 pixels instead of sampling each one individually. More information on how this is used in gaussian blurs can be found at: https://www.rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/#:~:text=see%20the%20difference.-,Linear%20sampling,-So%20far%2C%20we

@shaharbd Thanks for excellent reply! You've given me much to look into here! I appreciate it.

Custom CoreImage Box Blur kernel Much slower than equivalent built-in CIBoxBlur
 
 
Q