Metal & Compute shaders: performance issue

Hi,


I'm having a test run with compute shaders on Metal. I started with the MetalImageProcessing sample code, which basically does the following:


- makes a gray version of an input image into a texture

- renders that texture to the display


The source image & texture are 800x533 (see sample code)

I'm running this on an A7 iPad (running iOS 9b4), and I'm a bit surprised to only achieve 40 FPS with that setup. I tried to tweak the threadgroup size & the shader code itself, with no luck. CPU gauge in Xcode show 24.6ms, GPU gauge 18ms.


This brings a number of questions:


- is this subpar performance due to beta OS, or hardware, or something else?

- is it due to compute shader? (As opposed to vertex/fragment I should say)

- are compute shaders a recommanded approach?


I'm pretty sure I could manage at least 5x improvements (if not much more) using GLES vertex / fragment on the same hardware, so this is the reason why I'd like more clues before committing to Metal. Or maybe revert to vertex/fragment even with Metal.


Thank you.

Replies

I don't know on the ios platforms, but on osx I have measured between one third and half the performances of OpenCL, with compute kernels that are the exact counterpart of the OpenCL ones.


I guess and I hope that this is due to drivers immaturity.


It would be really useful to have a return from the Metal staff on this forum. At that time, we have to walk into the dark, without sufficiently detailed documentations, suspected drivers bugs from beta to beta, no words from the Apple staff, no returns on our radars, and so on... This is not a professional attitude. Sure, we could burn one TSI, but this is not allowed on a beta... I'm sure that Apple's engineers are really busy, but I guess we could surely beneficiate from each other experience.

Hi, thank you for your answer. I'm about to do some more experiments:


- try the same algorithm with Metal / vertex + fragment instead of Metal / compute

- try the same thing with GLES / vertex + fragment (to FBO then display)


Then I'll try to see if A8 vs. A7 makes any difference. I'm also interested in having some clues from Apple engineers.

That's interesting, so please, keep us up-to-date !


And do not hesitate to post a radar with your result. I sent mine and two projects (an OpenCL and Metal one) last week. If an Apple employee is reading here, this is its number : 21831201

Finished my tests. Render to texture with color-to-gray conversion, then render result to display. Image size: 800 x 533. (started from the MetalImageProcessing).


I tried 4 APIs: GLES, Core Image (to MTL texture, then Metal view, iOS 9 only), Metal Compute, Metal Graphics (vertex + fragment),

on 4 different configurations: A7 running iOS9, A7,A8,A8X running iOS8.


Fastest is Metal Graphics with GLES very close second. That can be explained because geometry is very simple, just 2 triangles.

Metal Compute is behind, about 5x slower than Metal Graphics on A7 (iOS8 & 9), 1.5x slower on A8. Compute is about 30% faster on iOS9 than Compute on iOS8 (A7). Core Image is on par with Metal Compute as far as I can tell.


I used a combined device utilization + FPS as a metric.


Note: A8 measures might be less accurate because GPU is so powerful that it hides the real load.


The conclusion is that Metal Graphics can be a better bet if you have the choice of the API for a given algorithm. I imagine this can vary a lot depending on what the processing does (number / distance of samples, computations, etc.)

Thank you for sharing !

That's quite interesting, and confirms that compute kernels are not (yet?) optimized... Why such a difference with the Metal graphics path, whereas the compiler has to do a similar job ?


I have noticed that there is options for the Metal compiler level of optimization in Xcode. On OSX, it seems that there is absolutely no differences by selecting one level or the others. Have you seen some improvement by switching between these optimisation levels ?

I did not try different Metal compiler settings (default ones for instruments builds / Release). Would be hard to redo the tests now, but it's something I'd like to do when we get the GM.


For the compute vs graphics performances, my understanding is that it is not linked to shader code performance, but more to how data is fed to the GPU. Textured rendering is an extremely common and optimized path (for games), while compute may be more recent & less mature path, with different kind of overhead. Anyway, that could change with version updates / hw releases.

Can you please file a radar with reproducible case and your findings. We will take a look asap.

I noticed the same poor performance running the MetalImageProcessing sample on an iPhone 5S. It takes around 13ms. to run the compute kernel which is kinda extreme for the computations it is doing. Even the RenderCommandEncoder takes 2.x ms. to just render a simple textured quad... This is really, really bad performance compared to the OGL pipeline. I've build it with the recent Xcode 7 GM. Can anyone shed some light on this?

I was testing also with iPhone 5S. I implemented a StreamScan algorithm with Metal compute and tested it with different arrays of float32 values.

With IOS 8.4 scanning 8192 values takes 1.2 ms.

With IOS 9 (released yesterday) the time went to 2.1 ms. ALMOST TWO TIMES AS LONG!

Times with other lengths of array were also consistently almost two times slower.

What is going on?

On iOS 9, what's the bet an array of float64 values takes 2.1ms, too?