Surprising HPC results with M1 Max (OpenCL is stellar, Metal not so much)

I have a project that solves the viscoelastic equation for sound transmission in biological media https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. This code supports CUDA, OpenCL, Metal, and OpenMP backends. We have done a lot of fine-tuning for each backend to get the best performance possible for each platform. Details of the numerical simulation and hardware used are detailed in the link above. Here you can see a summary of the results: First of all, the M1 Max is a knockout to both AMD and Nvidia, but only if using OpenCL. Worth noting, the OpenMP performance of the M1 Max is also more than excellent. It is simply mindblowing the M1 Max is neck to neck to an Nvidia RTX A6000 that cost more than the Macbook Pro that was used for the test. Metal results, on the other hand, are a bit inconsistent. Metal shows excellent results on AMD W6800 Pro (the best computing time of all tested GPUs), but not so much with a Vega 56 or the M1 Max. For all Metal-capable processors, we used the first formula recommended at https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes.

Further tests trying different domain sizes showed that the M1 Max with OpenCL can get even better results than the A6000, but Metal remains lagging by a lot.

Is there something else for the M1 Max with Metal that I could be missing or worth exploring? I want to be sure our applications are future-proof, given it was even surprising OpenCL is still alive in Monterey, but we know it is supposed to be discontinued at some point.

Answered by samguthrie in 728827022

I just want to wrap up this thread as we managed to finally bring OpenCL and Metal to show the same level of performance. It took a lot of changes (replacing C++ wrappers by Swift and then later with a modified Python library that compiles kernels on the flight), but ultimately the biggest difference was we packed as many constants as possible as #define statements instead of passing them through constant memory. Once that change all those changes were done, finally, the M1 Max Metal's performance is slightly better than OpenCL, so an improvement of 300%, which was dramatic. For those interested, the newest and much more simplified code is at https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. Below is a screenshot of the performance test (I also pushed it to a more challenging test that illustrates better how the M1 processors stand vs the A6000).

Hi Sam,

We've looked at your project some, but we haven't had time to go through it thoroughly. We theorize that there is some pathological case you're hitting in the compiler. Since it's such a large delta, it could even be something as bad as your kernels spilling out of the register space, but it's hard to know exactly how that could happen.

It would be helpful if you performed a Metal capture. This will give you utilization and limiter statistics which may give a hint as to what's going awry here.

Hi guys, thanks for taking a look, really appreciate it.

I tried to explore the Metal capture at some point, but because the project is coded outside XCode, I didn't manage to make it work. For example, I tried to follow these suggestions (https://alia-traces.github.io/metal/tools/xcode/2020/07/18/adding-framecapture-outside-of-xcode.html) (https://developer.apple.com/documentation/metal/frame_capture_debugging_tools/capturing_gpu_command_data_programmatically?language=objc#3325255) to capture Metal outside XCode in a gputrace file but MTLCaptureManager.supportsDestination(.gpuTraceDocument) returned that is not supported. I read I needed to enable it with a Info.plist with the flag MetalCaptureEnabled set to true.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>MetalCaptureEnabled</key>
    <true/>
</dict>
</plist>

But I never managed to make it work. I put Info.plist in the directory where I run my application (via Python) as suggested here (https://stackoverflow.com/questions/63707866/macos-metal-failing-to-capture-gpu-frame-from-command-line-app), in the location of the libraries, Python binary, etc) but never seemed to be taken into account.

Do you have any other suggestions to enable the capture? Same for the instruments trace, any hint to do it from a command-line application would be welcome.

Cheers,

S

Amazing, that METAL_CAPTURE_ENABLED should be mentioned in the official documentation! :). It worked without a problem, now I can capture the gputrace.

There is indeed some register spilling in 2 of 3 kernels being run in the main for..loop of the FDTD process. As commented, I need to use MTLCommandBuffer.waitUntilCompleted() at each loop otherwise the results start to be unstable given the nature of the time-difference method (you need to wait that stress values finish calculation before calculating particle values, and so on).

For the spillage, and after analyzing the trace, I may have an idea to try to split the most intense kernels (they are massive currently) into mini kernels that should reduce the number of registers. These mini kernels can be run independently using a single command buffer in a single encoding operation, hoping this will maximize occupancy. It would be interesting to see how all 3 backends (OpenCL, Metal and CUDA) behave since this splitting will be common to all of them.

Thanks again for that METAL_CAPTURE_ENABLED info, I can see I will use it heavily in the future.

The instruments trace shows that the OpenCL execution has higher occupancy and much higher memory bandwidth (390 GB/s in OpenCL vs 110 GB/s in Metal). The access to memory strategy of the OpenCL alone can explain the difference. But keep in mind this execution of the OpenCL was with the very large kernels, so memory is being accessed aggressively, still, it is nice to see hitting the peak performance.

This is the Metal instruments overview using the "mini-kernels" approach

and here OpenCL using the original large kernels

From the instruments trace (and apologies since I'm just learning how to use it correctly), besides seeing some of the evidence, how can go into more details to identify what specific bottleneck may explain the difference?

So we took a look at the code a bit and found a couple of things items note.

  • It doesn't appear as if the OpenCL path is actually waiting anywhere or at least we couldn't find where it was doing this. There appears to be logic that selects which API to call to process each portion of work. Each batch creates a queue the Metal side and also waits for completion on every iteration inside the processing function. However we don't see this in the OpenCL path.
  • This segment of code recreates buffers every iteration. Buffers are meant to be reused so recreating them so often can incur significant overhead. The data you're passing to the kernel with these buffers is small enough that you probably can just use the MTLComputeCommandEncoder.setByte(_:length:index:) method to send that data.
  • Here it looks like you're recreating the compute pipeline each iteration. In the worst case. this means you're invoking the shader compiler every time, which is an incredibly expensive operations. So this should be done before any processing and the pipeline should be reused each iteration.

Also, it's difficult to tell what's happening in the instruments traces you posted as we can't see the the command buffers and encoders that are responsible for the work measured here. Including the Metal Application and GPU tracks would tell us that.

To send us info, it's probably easier if you create a Feedback Assistant report (and repost the number here). You can attach any files with data you're gathering.

Awesome, thanks for taking a look, and apologies, I should have pointed you to the right spot beforehand... you are watching the wrong function :), This segment of code points to the Rayleigh-Sommerfeld Integral that is included in the library to complement the FDTD solution of the Viscoelastic solver, which is mainly covered in the FDTD3D_GPU_VERSION.h file. I'm linking to the experimental branch FirstTestAppleSilicon that has all the recent "mini-kernels" approach, those changes have not been yet merged into the main branch. All the recent tests of the instruments profiling were done with that new branch.

As a side note, yes, in the ForwardSimpe function I should move out of the main loop the defaultLibrary.makeFunction(name: "ForwardSimpleMetal")! call.], but that function is already working quite well, the one that really initiated this thread is at FDTD3D_GPU_VERSION.h

So back to FDTD3D_GPU_VERSION.h, there you can see all the 3 GPU backends compilations controlled with macro conditions (i.e #if defined(METAL) ...) . You can see (here), that for the code in question I didn't do that mistake of creating the library inside the main loop. Be aware that for FDTD3D_GPU_VERSION.h I used a this wrapper mtlpp instead of a Swift-based interface since FDTD3D_GPU_VERSION.h was originally developed as a C-based Python extension (I have a student who will work to change that so we can use Swift instead, but that is for another day). You can see there how each of the GPU backends get their synchronization.

Btw, the mini-kernel Metal approach showed significant improvements for the AMD GPUs, the W6800 is now showing even better results and the Vega 56 shows now also a better performance than OpenCL. Here you can see updated benchmarks,

The Vega 56 was the one showing a dramatic improvement from 144s to 83s after using the mini-kernels approach. I haven't yet completed doing the proper tests in OpenCL and CUDA to see how the mini-kernels approach will work with those backends. But at least this exercise has been quite productive to improve performance in Metal-supported GPUs, it is clear that with adequate fine-tuning Metal can and should do better than OpenCL, but now I just need to continue to investigate how to ensure Metal shows also a similar trend in the M1 Max as it shows with the AMD GPUs.

Thanks again for taking a look.

It is pity this seems to get lost through the cracks, and now I have a bigger issue. The latest MacOS 12.3 update truly broke the operation of my W6800 GPU!!! I can't run anymore any of my Metal code, it just hangs there indefinitely. I saw multiple reports of a drop in performance with eGPUS associated with the latest MacOS release. Please check what is going on! This is highly disruptive.

Hi Sam,

Your issue hasn't been lost. We're still looking at it, but I'm sorry for the lack of updates.

Wow, so none of your code runs. Do have any more details you can share? Does the whole system hang or just your app?

Thanks for the reply. The problem with my eGPU was just fixed yesterday with the roll of MacOS 12.3.1. There were reports all over the place in many forums of issues with eGPU associated with 12.3. After the update, now my programs were able to run again. It was just my app that hanged. But anyway, it is not a problem anymore. Looking forward to hearing more on the original issue. Maybe to give extra motivation, given the excellent experience with the M1 Max, for our applications, we are now considering using Mac Studios with the M1 Ultra (and anything that is your pipeline of new processors) for neuronavigated procedures in human participants where this numerical model running in OpenCL/Metal will be tightly integrated, so being able to be future-proof with Metal is truly something critical for us moving forward.

Hi guys, any updates on what I could try out? Or is this something you think could be addressed in the future with Metal 3 once it becomes available?

Accepted Answer

I just want to wrap up this thread as we managed to finally bring OpenCL and Metal to show the same level of performance. It took a lot of changes (replacing C++ wrappers by Swift and then later with a modified Python library that compiles kernels on the flight), but ultimately the biggest difference was we packed as many constants as possible as #define statements instead of passing them through constant memory. Once that change all those changes were done, finally, the M1 Max Metal's performance is slightly better than OpenCL, so an improvement of 300%, which was dramatic. For those interested, the newest and much more simplified code is at https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. Below is a screenshot of the performance test (I also pushed it to a more challenging test that illustrates better how the M1 processors stand vs the A6000).

Surprising HPC results with M1 Max (OpenCL is stellar, Metal not so much)
 
 
Q