iPhone8 GPU compute kernel performance drops a lot after upgrading to Xcode11

Hi, I'm asking for help and also see if anyone else came up with the same issue.


My app has some compute kernel functions (pre-compiled into .metallib) run on iOS GPU, the performance is good on iPhone8 before upgrading to Xcode11. But after upgrading to Xcode11, the same compute kernel is 40% slower!


I suspect it was due to some unpublished changes in the metal compiler from Xcode11. The performence went back to good again if I re-compile the .metallib with Xcode10.3 and just put into the Xcode 11 compiled app.


It looks like the GPU register allocation strategy has been changed in the new Xcode11? I'm not sure


I compared the LLVM IR output between Xcode10 and Xcode11, seems only the order and arrangement of some for-loop blocks are different. But why does this have so much negative impact on performance?


Could Apple explain why, or what has happened to the Metal compiler in Xcode 11? This behavior is really strange and mysterious.


Regards.

Replies

I can attach the AIR files compiled with Xcode 10 and Xcode 11 for your comparison if needed

I've nothing to add on the actual issue (will have to test this out though, as it could affect some of my stuff) but if you'd like to look at what's actually happening with the compiler, I posted some details on how to decompile the shader cache in this thread: https://forums.developer.apple.com/thread/119625

Thanks. I will follow your link and see if I can find anything useful.
I compared the LLVM IR between Xcode10 and Xcode11 without find any important differences. Soooo strange!

Have you checked in the profiler that it's the compute kernel itself slower, and not something happening outside it causing a wait?

Yep, I just replaced the AIR file generated from Xcode11 into the one from Xcode10, leaving other framework code unchanged. The performance goes back again. This should be a fair comparison.


Actually their LLVM IRs are not exactly the same, but seems only the order and arrangement of some for-loop (phi) blocks are different. But I've no idea why does this minor difference in IR have so much negative impact on performance.


Sounds like currently there should be no solution as nobody has the ISA and assembler of Apple's GPUs (because of the closed ecosystem of Apple). Things are out-of-control at the Metal language level. Even a slight difference in IR (because of upgrading to a newer Xcode version) might lead to huge performance drop when running the final machine code. This is too bad!