Post

Replies

Boosts

Views

Activity

Comment on Surprising HPC results with M1 Max (OpenCL is stellar, Metal not so much)
Indeed, all 3 backends do similar sync conditions as you describe. So they are being compared on the same "grounds" the best I can. I managed to recode things so splitting could be tailored very granularly using macros (from running the original large Kernels to splitting in multiple mini kernels) that ensure that only the strictly necessary code is present at the compilation time of each mini kernel. The mini-kernel approach (and putting all those in a single encoder before doing a commit) helped to improve the Metal computing time in the M1 Max by roughly 15%, but still far away from the OpenCL times (157s in Metal vs 57s with OpenCL). I verified with the gputrace that now all mini kernels do not show anymore any register spilling. I may guess that the Metal power management may be limiting the execution; these are indeed very intense computing kernels. The original large kernels involve doing 3D operations over 30+ 3D arrays. Doing the splitting in mini kernel limited the number of buffers being accessed each time. Something worth noting is that it is different in the Metal execution, compared to OpenCL and CUDA. As mentioned above, these kernels need to access a lot of separate 3D arrays. In CUDA, they are passed as a single parameter structure that contains pointers to all the 3D arrays. In OpenCL, this didn;t work but I just simply pass like 55 input parameters. But in Metal, I have to do some packaging to merge several 3D arrays in a single buffer to be sure all info could be passed in 32 or fewer input parameters. The indexing to the arrays can be smartly managed via macros. In Metal, this translated that for accessing the beginning of Array X always involves adding an offset. So that is an operation that is not present in the CUDA and OpenCL operation. But I do not think this should translate into a big penalty. As noted in the first post, the W6800 Pro Metal execution was the best of the pack and includes these operations. Coming back to power management, Is there a programmatic way to disable Metal power management that could be explored? edit:typo
Jan ’22
Comment on Surprising HPC results with M1 Max (OpenCL is stellar, Metal not so much)
Awesome, I just created the report FB9882670 , so you can take a look now at the Instruments trace. As noted in the report, in the capture, I ran a shorter simulation than the one used for the table above, but it still shows the same level of difference between OpenCL and Metal in the M1 Max. In the Instruments capture, you will find the following 3 runs: Run 1: Metal-based capture with the M1 Max using the newest kernels implementation based on my chats with the Engineers in the Developer forums thread where I split my problem into "mini-kernels". This version is the one showing the best Metal-based performance, but still significantly slower than OpenCL. Run 2: OpenCL-based capture with the M1 Max showing the best performance. This one used the original (ver large) kernels for my FDTD solver. Run 3: Metal-based capture of the same kernels used for OpenCL. This shows the worst performance of the 3 runs. Looking forward to hearing your thoughts on the potential issues I may have incurred in my Metal implementation, Cheers S
Feb ’22
Comment on Surprising HPC results with M1 Max (OpenCL is stellar, Metal not so much)
Absolutely. I already published in the pip repository the changes with the mini kernels as those already show a clear improvement. I created a new branch NewTestMetal where I added a new script SimpleBenchmark.py and instructions to prepare the environment and run the test. It should be quite straightforward to get it all running in a couple of minutes. Pretty much having a healthy brew installation should ensure running all very smoothly. Let me know of any issues. Worth mentioning, this test is a bit shorter than the one used to create the comparison table shown above. But it still shows a similar difference in computing time (3 s with OpenCL vs 8 s with Metal). So this setup is enough and faster to work with to identify issues and do fine-tuning.
Feb ’22