I would like to write a ReductionSum Metal Shader like this:
https://github.com/alibaba/MNN/blob/master/source/backend/metal/MetalReduction.metal#L32
Sometimes the reduced dimension is large while the other dimensions is small, which cause few threads can be launched and inefficient.
Is there any way to optimize it?