MPSCNNConvolution using just 8 or 16 threads per threadgroup and therefore underperforming

Hi, I have a neural network with lots of MPSCNNConvolution layers. I also have other layers like a transpose convolution layer, for example, which I implemented.

Debugging my app using GPU Frame Capture I saw that my transpose convolution takes 4ms to execute and a MPSCNNConvolution takes between 5.5 and 7 most of the times with one even reaching 13ms. Having a deeper look at the convolutions I saw that the threadgroup sizes were {2, 2, 4} or {2, 1, 4} which clearly is not enough for best performance as it does not even fill the execution width.

I did not find a way to specify threadgroup sizes for a MPSCNNConvolution layer so I am thinking about implementing it myself if that cannot be found. Is there a way to specify the threadgroup sizes? Or why are they so small?


Thanks,

Mathias

Replies

I've never got GPU Frame Capture to reliably work with compute shaders so maybe what you're seeing is not really what is happening. I don't think there is a way to change the threadgroup sizes for MPS objects because the MPS framework exists to hide that stuff from the developer.


However, I am interested in your transposed convolution. I assume you implemented this as the backward pass of convolution. If so, it is equivalent to doing a forward convolution with the kernel weights flipped horizontally and vertically (i.e. the kernel weights reversed in memory).


So I am curious: if you use MSPCNNConvolution with a certain kernel and then use your transposed convolution but with the kernel weights reversed, does this actually give the same output (within a certain amount of precision)?

I did not try reversing the kernel weights.

I also noticed that GPU Frame capture was not working for me prior to Xcode 8.3 but now it seems to be working. It is also returning the correct values for the threadgroup sizes of all the custom layers so I suppose that the values it returns for the MPSCNNConvolution are the correct ones.