I have a project that solves the viscoelastic equation for sound transmission in biological media https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. This code supports CUDA, OpenCL, Metal, and OpenMP backends. We have done a lot of fine-tuning for each backend to get the best performance possible for each platform. Details of the numerical simulation and hardware used are detailed in the link above. Here you can see a summary of the results: First of all, the M1 Max is a knockout to both AMD and Nvidia, but only if using OpenCL. Worth noting, the OpenMP performance of the M1 Max is also more than excellent. It is simply mindblowing the M1 Max is neck to neck to an Nvidia RTX A6000 that cost more than the Macbook Pro that was used for the test. Metal results, on the other hand, are a bit inconsistent. Metal shows excellent results on AMD W6800 Pro (the best computing time of all tested GPUs), but not so much with a Vega 56 or the M1 Max. For all Metal-capable processors, we used the first formula recommended at https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes.
Further tests trying different domain sizes showed that the M1 Max with OpenCL can get even better results than the A6000, but Metal remains lagging by a lot.
Is there something else for the M1 Max with Metal that I could be missing or worth exploring? I want to be sure our applications are future-proof, given it was even surprising OpenCL is still alive in Monterey, but we know it is supposed to be discontinued at some point.
I just want to wrap up this thread as we managed to finally bring OpenCL and Metal to show the same level of performance. It took a lot of changes (replacing C++ wrappers by Swift and then later with a modified Python library that compiles kernels on the flight), but ultimately the biggest difference was we packed as many constants as possible as #define statements instead of passing them through constant memory. Once that change all those changes were done, finally, the M1 Max Metal's performance is slightly better than OpenCL, so an improvement of 300%, which was dramatic. For those interested, the newest and much more simplified code is at https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. Below is a screenshot of the performance test (I also pushed it to a more challenging test that illustrates better how the M1 processors stand vs the A6000).