Hi, I have a neural network with lots of MPSCNNConvolution layers. I also have other layers like a transpose convolution layer, for example, which I implemented.
Debugging my app using GPU Frame Capture I saw that my transpose convolution takes 4ms to execute and a MPSCNNConvolution takes between 5.5 and 7 most of the times with one even reaching 13ms. Having a deeper look at the convolutions I saw that the threadgroup sizes were {2, 2, 4} or {2, 1, 4} which clearly is not enough for best performance as it does not even fill the execution width.
I did not find a way to specify threadgroup sizes for a MPSCNNConvolution layer so I am thinking about implementing it myself if that cannot be found. Is there a way to specify the threadgroup sizes? Or why are they so small?
Thanks,
Mathias